Functional divergence of duplicate genes several million years after gene duplication in Arabidopsis

Functional divergence of duplicate genes several million years after gene duplication in Arabidopsis Lineage-specific duplicated genes likely contribute to the phenotypic divergence in closely re- lated species. However, neither the frequency of duplication events nor the degree of selection pressures immediately after gene duplication is clear in the speciation process. Here, using Illumina DNA-sequencing reads from Arabidopsis halleri, which has multiple closely related spe- cies with high-quality genome assemblies (A. thaliana and A. lyrata), we succeeded in generat- ing orthologous gene groups in Brassicaceae. The duplication frequency of retained genes in the Arabidopsis lineage was 10 times higher than the duplication frequency inferred by compara- tive genomics of Arabidopsis, poplar, rice and moss (Physcomitrella patens). The difference of duplication frequencies can be explained by a rapid decay of anciently duplicated genes. To ex- amine the degree of selection pressure on genes duplicated in either the A. halleri-lyrata or the A. halleri lineage, we examined positive and purifying selection in the A. halleri-lyrata and A. halleri lineages throughout the ratios of nonsynonymous to synonymous substitution rates (K / K ). Duplicate genes tended to have a higher proportion of positive selection compared with non-duplicated genes. Interestingly, we found that functional divergence of duplicated genes was accelerated several million years after gene duplication compared with immediately after gene duplication. Key words: plant evolution, gene duplication, Arabidopsis, selection pressure, functional divergence V C The Author 2018. Published by Oxford University Press on behalf of Kazusa DNA Research Institute. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com 1 Downloaded from https://academic.oup.com/dnaresearch/advance-article-abstract/doi/10.1093/dnares/dsy005/4898128 by Ed 'DeepDyve' Gillespie user on 08 June 2018 2 Functional divergence after gene duplication 1. Introduction collected the plants in Mt. Ibuki, Shiga, Japan. We then extracted the plant DNAs, and performed the paired-end Illumina DNA sequencing. Plant genomes have experienced more gene duplication events than After generating contigs from the Illumina DNA-sequencing reads, we most other eukaryotes. These duplications are largely classified into inferred A. halleri orthologous genes against the A. thaliana and whole genome duplications (WGDs) and small-scale duplications A. lyrata genes. The number of SSDs was inferred both from phyloge- (SSDs). WGD events are concentrated in the Cretaceous-Paleogene netic analysis of orthologous genes and the reading depth of Illumina extinction period. SSDs such as retroduplication and tandem dupli- DNA sequencing. As outgroup species for the three Arabidopsis species, cation have occurred continuously at a high rate during plant evolu- we used five non-Arabidopsis species in Brassicaceae for which genome tion, indicating that SSDs may be a key factor for plant speciation or 3–8 sequences are already determined. There is trans-specific polymorphism phenotypic differences in close relatives. SSDs tend to be lineage- in Arabidopsis; in particular, some genes have undergone gene flow in specific in land plants. There is a clear functional bias between 4,9 Arabidopsis. Here, excluding genes that have undergone gene flow in WGD and SSD in plants. SSDs tend to be associated with stress re- 4,8–10 Arabidopsis, we generated three sets of orthologous gene groups sponses, and are likely important for adaptive evolution to rap- (OGGs) among five non-Arabidopsis species, A. thaliana, A. lyrata and idly changing environments in close relatives. A. halleri in Brassicaceae (Fig. 1). Each OGG represent genes derived SSD genes functionally diverge through either neo-functionalization 11,12 from one ancestral gene in each speciation event. By identifying the or sub-functionalization, which may contribute to speciation. SSD genes in each OGG, we examined the evolutionary process in the genes can also retain functional redundancy through the long-term fate 13,14 15,16 Arabidopsis lineage. Although a BAC library-based genome for A. hal- of becoming pseudogenes, dosage selection or avoiding devel- 17–19 leri is unavailable, there is an available A. halleri genome based on opmental errors. Functional divergence of duplicated genes can be long-insert mate paired reads and short-insert paired-end reads in inferred by selection pressures based on the non-synonymous substitu- 38,39 Illumina DNA sequencing. The available A. halleri genome was tion rate (K ) versus the synonymous substitution rate (K ). Genes un- A S originated from A. halleri subsp. gemmifera which was the same sub- dergoing positive selection (K /K > 1) and purifying selection (K /K A S A S species in our plant material. The collection site was in Tada mine, < 1) are associated with functional divergence and constraint, respec- Hyogo, Japan. Microsatellite analyses suggested that our plant material tively. Most earlier studies compared selection pressures in WGDs collected from Mt. Ibuki, Shiga, Japan was genetically closed to the with those in SSDs, or selection pressures in anciently duplicated genes 21–24 plant with the available genome. Thus, it is likely that A. halleri genes (several hundred million years [MY] ago) in plants. There are little inferred from the available genome are similar to our inferred A. halleri data with which to examine selection pressures in recent SSDs in plants. genes. To validate our overall results, we also examined SSDs based on Because selection pressure likely varies during the speciation process, it the available A. halleri genome, is important to examine the selection pressures immediately after gene In a previous report, selection pressures of genes duplicated after duplications among close relatives. the divergence of A. thaliana and A. lyrata were examined in the A. Recently, there are many plant genomes assembled by Illumina thaliana lineage. These lineage-specific duplicated genes tend to DNA-sequencing reads. However, SSDs tend to be underestimated 26,27 have undergone positive selection. Here, we examined the selection in contigs generated by only Illumina DNA-sequencing. It is pressures in duplicated genes in the other lineage after the split be- likely that genomes generated by BAC are useful to examine SSDs. tween A. thaliana and A. lyrata. This lineage was further split into However, construction of such a library takes much budget and time the A. halleri–lyrata lineage and A. halleri lineage (Fig. 2). We then to generate highly qualified genomes. In the present study, we tried examined whether gene duplications in either the A. halleri–lyrata or to generate reliable orthologous genes by only paired-end Illumina A. halleri lineage contributed to functional divergence in the retained DNA-sequencing reads using available genomes of close relatives. duplicate genes. We then inferred SSDs in a species after the divergence of closed spe- Duplicated genes may have undergone purifying selection in the cies throughout the reading depth of Illumina DNA sequencing. Arabidopsis lineage. The gene dosage hypothesis proposes that du- The Arabidopsis lineage seems to be the best lineage to examine re- plicate genes can be fixed for increased gene dosage, keeping redun- cent SSDs in plants because Arabidopsis has two BAC library-based ge- 28,29 dant functions among duplicated genes. In A. halleri, the expression nomes in Arabidopsis thaliana and Arabidopsis lyrata. In of HEAVY METAL ATPASE 4 (HMA4) has been enhanced by cis- particular, A. thaliana has a large amount of functional data for its an- regulatory mutations and increased gene copy number for metal notated genes. To comprehensively examine SSDs in Arabidopsis,we in- hyperaccumulation. This is an example of increased gene dosage vestigated an additional Arabidopsis species—A. halleri. The divergence 30 by gene duplication. Consequently, duplicated genes with purifying time between A. lyrata and A. halleri is supposed to be 2MYA, and selection tend to be associated with increased gene dosage. the divergence time between either A. halleri or A. lyrata and A. thaliana 31 To understand the contribution of duplicated genes to the specia- is 10–11 MYA. Among recently diverged three Arabidopsis species, tion process in Arabidopsis within the last 10 MY, we first examined there are phenotypic variations such as self-compatibility and heavy- the frequency of gene duplications in the A. halleri–lyrata and A. hal- metal tolerance. A. thaliana has self-compatibility, but the others are 32–34 leri lineages. We then examined the relationship between gene dupli- self-incompatible. Therefore, A. halleri and A. lyrata have large pet- cation and selection pressure based on the K /K ratio. Finally, we A S als to attract pollinator insects, and the anthers are separated from the examined the over-represented functional categories of genes under stigma to avoid autopollination. A. halleri is known as a Zn/Cd positive and purifying selection. 33,34 hyper-accumulator andsome wildpopulations of A. lyrata are toler- ant of serpentine soils, which are characterized by a high heavy-metal content, while A. thaliana is not. 2. Materials and methods A. halleri has highly variable morphologies with respect to leaf, 2.1. Sampling and illumina paired-end flower colour and development of stolons among the subspecies. To ex- amine SSDs in the Arabidopsis lineage, we focused on A. halleri subsp. DNA-sequencing analysis gemmifera which grows in the Russian Far East, northeastern China, Arabidopsis halleri subsp. gemmifera is one of the three subspecies of Korea, Taiwan and Japan across lowland and highland areas. We A. halleri, and grows in the Russian Far East, northeastern China, Downloaded from https://academic.oup.com/dnaresearch/advance-article-abstract/doi/10.1093/dnares/dsy005/4898128 by Ed 'DeepDyve' Gillespie user on 08 June 2018 K. Hanada et al. 3 nonA-AT-AL-AH OGGs AT-AL-AH OGGs AL-AH OGGs B. rapa, B. stricta C. grandiflowa A. halleri (AH) A. thaliana (AT) A. lyrata (AL) C. rubella, E. salsugineum (non-A) Figure 1. Three sets of OGGs among non-Arabidopsis species, A. thaliana, A. lyrata and A. halleri. OGGs between A. lyrata and A. halleri were defined as AL–AH OGGs. There were 25,833, 26,428 and 26,007 AL–AH OGGs based on Illumina paired-end DNA-sequencing reads without any pseudogene-like genes, Illumina paired-end DNA-sequencing reads including pseudogene-like genes and the available A. halleri genome, respectively. OGGs among A. thaliana, A. lyrata and A. halleri were defined as AT–AL–AH OGGs. There were 22,105, 22,684 and 21,537 AT–AL–AH OGGs based on Illumina paired-end DNA-sequencing reads with- out any pseudogene-like genes, Illumina paired-end DNA-sequencing reads including pseudogene-like genes and the available A. halleri genome, respectively. OGGs among non-Arabidopsis species (B. rapa, B. stricta, C. grandiflora, C. rubella, E. salsugineum), A. thaliana, A. lyrata and A. halleri were defined as nonA- AT–AL–AH OGGs. There were 17,669 and 17,925 non-A-AT–AL–AH OGGs based on Illumina paired-end DNA-sequencing reads without any pseudogene-like genes and the available A. halleri genome, respectively. 450-700 mya -3 1.8-3.5 x 10 (gains/gene/my) 200 mya -3 2.0-2.5 x 10 (gains/gene/my) 100-120 mya -3 2.5-3.0 x 10 (gains/gene/my) 33-36 mya A. halleri-lyrata lineage 10-11mya -2 (gains/gene/my) 1.6-2.0 x 10 2 mya A. halleri lineage -2 5.8-7.6 x 10 (gains/gene/my) P. patens O.sativa P.trichocarpa B. rapa A. thaliana A.lyrata A.halleri Figure 2. Gain rates through gene duplication in land plants. The divergence times among moss (P. patens), rice (O. sativa), poplar (P. trichocarpa), B. rapa, 30,31,72–75 A. thaliana, A. lyrata and A. halleri were taken from previous reports. Gene gains through gene duplication were estimated for each branch. The gene gains in two branches were estimated in this study. One was the A. halleri-lyrata lineage, which represents the evolutionary process of the A. halleri lineage be- tween the divergence of A. thaliana and the divergence of A. lyrata. The other was the A. halleri lineage, which represents the evolutionary process of the A. hal- leri lineage after the divergence of A. lyrata. Gene gains were divided by ancestral gene numbers and branch lengths corresponding to times (MY). The rate of gene gain was defined as the number of genes gained through gene duplication per gene per MY. Korea, Taiwan and Japan. In 2008, a leaf sample was collected nuclear genome, sequencing reads mapped to either the mitochon- from an individual of A. halleri subsp. gemmifera on Mt. Ibuki, drial or chloroplast genome in A. thaliana (https://www.arabidopsis. Shiga, Japan. DNA was extracted from the leaf using a DNeasy org, TAIR10) or A. lyrata (http://genome.jgi.doe.gov, Filtered Plant Mini Kit (QIAGEN, Venlo, The Netherlands). A 300-bp Models6) by BLASTN version 2.2.29 (e-value < 1.0) were excluded paired-end DNA library was constructed according to the paired- from the following procedures. Assuming that the genome size of End Genomic DNA Sample Preparation Kit (Illumina, San Diego, A. halleri was 255 Mb, the sequencing coverage was estimated to CA, USA), and 93-bp paired-end reads were obtained from the be 135 (34.4/0.255 Gb). Illumina GAIIx sequencer. A total of 44.5 Gb reads were generated by Illumina DNA paired- 2.2. Gene sequences of Brassica rapa, Boechera stricta, end sequencing. Using the Trim Galore (www.bioinformatics.babra Capsella grandiflora, Capsella rubella, Eutrema ham.ac.uk) software, sequences with low-quality base calls (Phred score < 20) were trimmed off from the 3 end of the reads. Adapter salsugineum, A. thaliana and A. lyrata sequences started with ‘AGATCGGAAGAGC’ were completely re- Gene sequences in B. rapa (BrapaFPsc_277_v1.3), B. stricta moved from the reads. Approximately 28% of reads had either low- (Bstricta_278_v1.2), C. grandiflora (Cgrandiflora_266_v1.1), C. ru- quality scores or adapters and were trimmed by Trim Galore. When bella (Crubella_183_v1.0) and E. salsugineum (Esalsugineum a paired-end read was completely removed by this procedure, the _173_v1.0) were collected from Phytozome (https://phytozome.jgi. other read was used as a single-end read. Given that mitochondrial doe.gov, version 10). Gene sequences in A. thaliana and A. lyrata and chloroplast genomes have much higher copy numbers than the were collected from TAIR (https://www.arabidopsis.org/, version 10, Downloaded from https://academic.oup.com/dnaresearch/advance-article-abstract/doi/10.1093/dnares/dsy005/4898128 by Ed 'DeepDyve' Gillespie user on 08 June 2018 4 Functional divergence after gene duplication TAIR10_cds_20110103) and JGI (http://genome.jgi.doe.gov, mapped by BOWTIE2 version 2.2.3. Reads per kilobase of exon FilteredModels6), respectively. model per million (RPKM) values were calculated for each A. halleri gene. For 12 A. halleri genes whose copy numbers were known (Supplementary Table S2), we designed primer pairs using the fol- 2.3. Assembly of A. halleri genes orthologous to lowing parameters in the Primer3Plus software: primer length of A. lyrata genes based on illumina paired-end 18–24 bases, amplicon length of 70–150 bp, T value of 57–63 C DNA-sequencing reads and GC composition of 40–60%. To obtain enough genomic DNA We first generated A. halleri contigs from a total of 44.5 Gb Illumina for droplet digital PCR (ddPCR), we mixed the genomic DNAs from DNA-sequencing reads using several assembly software tools such as four individuals of A. halleri from Mt. Ibuki. The genomic DNA was 45 46 47 Platanus 1.2.4, ABySS 1.5.2, SOAP-denovo2 and CLC sonicated with the Covaris M220 system (25 s in frequency sweeping Genomics Workbench 7.0.3 (https://www.qiagenbioinformatics. mode). The concentration of the sonicated genomic DNA sample com/) software. A higher proportion of genes in closely related spe- was 6 ng/ll. The peak size of sonicated DNA fragments was 2,382 cies tend to be mapped to reliable assembled sequences. As the clos- bp according to the Fragment Analyzer system (Advanced est species, 32,670 annotated A. lyrata genes were mapped to each Analytical, Ankeny, IA, USA) with the High Sensitivity Genomic type of assembled DNA segment by the gmap (version 2014-08-20) DNA Analysis Kit (Advanced Analytical). Each ddPCR reaction was software with default parameters, which takes into account exon–in- performed with ddPCR EvaGreen Supermix (Bio-Rad, Hercules, CA, tron junctions. The A. halleri contigs generated by the ABySS soft- USA). Each reagent was divided into 20,000 droplets with a drop- ware with 63 K-mer size (N50 size ¼ 5,331 bp) had the highest let generator (Bio-Rad QX-200). Cycled droplets were measured mapping rate to the 32,670 annotated A. lyrata genes with >80% with a QX200 droplet reader (Bio-Rad). coverage (Supplementary Table S1). Some of the matched regions To find A. halleri genes whose DNA concentrations were robustly were truncated by terminal codons. When coding sequences up to inferred by ddPCR, we first identified uniquely mapped regions (>80 the terminal codon had >80% similarity to and 80% coverage for bp) in A. halleri genes from the Illumina DNA-sequencing reads. an A. lyrata gene, the coding sequence was defined as the ortholo- Among the A. halleri genes with uniquely mapped regions, we manu- gous A. halleri gene sequence to an A. lyrata gene. Following this ally chose 50 A. halleri genes with a variety of RPKMs. We designed procedure, we generated 22,727 orthologous gene pairs between A. a pair of primers for each gene in the Primer3Plus software using the halleri and A. lyrata. However, an A. lyrata gene sequence was fre- parameters described earlier. To examine the specificity of the tar- quently mapped to different A. halleri contigs depending on the re- geted DNAs, we performed real-time PCR analysis using the proto- gion of the A. lyrata gene because the whole sequence of each A. col for the Mx3000P qPCR System (Agilent Technologies). The PCR halleri gene was split into several contigs. To further identify A. hal- analyses were performed using SsoFast EvaGreen Supermix (Bio- leri genes orthologous to A. lyrata genes, we re-assembled the A. hal- Rad) and the products were analysed using the Mx3000P multiplex leri contigs based on the mapping information of A. lyrata genes. To quantitative PCR system (Agilent Technologies). Specific and multi- collect additional A. halleri genes orthologous to A. lyrata, ple reactions should result in a single and multiple melting peaks cor- unmapped A. lyrata genes were re-mapped to the assembled DNA responding to the PCR products. Of the 50 primer pairs, 25 showed 5 43 segments by TBLASTN version 2.2.29 (e-value < 1  10 ). a clear raised curve for all three replicates. Thus, for the copy number However, the contigs mapped separately by A. lyrata genes may not assay by ddPCR, we used 13 primer pairs for unknown copy number be orthologous syntenic regions. To focus on only syntenic contigs genes, 10 pairs for single-copy genes in a broad range of plant species between A. halleri and A. lyrata, we defined syntenic contigs to and 2 pairs for a three-copy gene (Supplementary Table S2). which more than two A. lyrata genes within less than five spacer genes in A. lyrata genome were mapped. When an A. lyrata gene was separately mapped to two syntenic contigs between A. lyrata and A. 2.5. Assembly of A. halleri genes orthologous to A. halleri, the contigs were concatenated following the direction of the lyrata genes based on the available draft A. halleri A. lyrata gene. The A. lyrata gene was then mapped to the concaten- genome ated contigs. We thus obtained an additional 3,106 A. halleri gene Coding genes of A. halleri were generated on the draft A. halleri ge- sequences with >80% similarity and 80% coverage against A. lyrata nome using long-insert mate paired reads and short-insert paired-end genes. Finally, we succeeded in identifying 25,833 (79.1%) A. halleri reads in Illumina DNA sequencing. On the genome, we used anno- genes orthologous to 32,670 A. lyrata genes. Given that all identified tated coding genes. We performed BLAST search between the anno- A. halleri genes were paired with A. lyrata genes, a total of 25,833 tated coding A. halleri and A. lyrata genes by BLASTP. The best pairs were defined as A. lyrata–A. halleri (AL–AH) OGGs. matching pairs were defined to be AL–AH OGGs with more than In the above procedures, we did not identify pseudogene-like A. >80% similarity to and 80% coverage at amino acid level. We suc- halleri genes which are orthologous to A. lyrata genes. Consequently, ceeded in identifying 26,007 (79.6%) A. halleri genes which were duplication frequency may be underestimated in either the A. lyrata– orthologous to 32,670 A. lyrata genes. Out of 26,007 orthologous A. halleri or the A. halleri lineage. Therefore, we also identified pairs of A. halleri and A. lyrata genes, 22,105 (22,105/26,007 ¼ 26,428 orthologous A. halleri regions truncated by terminal codons 85%) A. lyrata genes were inferred in the AL–AH OGGs based on with >80% similarity and 80% coverage against A. lyrate genes. only Illumina paired-end DNA-sequencing reads. The number of spe- These analysis procedures and findings are summarized in cifically identified A. halleri genes based on Illumina paired-end Supplementary Figure S1. DNA sequencing and the available A. helleri genome was 3,728 and 3,902, respectively (Supplementary Fig. S2). Thus, either method had 2.4. Validation of A. halleri lineage-specific duplicated a particular advantage to infer the orthologous genes. genes by droplet digital PCR A. halleri genes duplicated in the A. lyrata lineage were identified To calculate the read coverage of each A. halleri gene, the mapped with the Inparanoid algorithm Given the best-match pair, A1 and count was divided by the number of genes to which a read was B1, an A. halleri gene that had a smaller sequence distance to A1 Downloaded from https://academic.oup.com/dnaresearch/advance-article-abstract/doi/10.1093/dnares/dsy005/4898128 by Ed 'DeepDyve' Gillespie user on 08 June 2018 K. Hanada et al. 5 (or B1) than the distance between A1 and B1, was added to the AL– (GO) assignments for the A. thaliana genes from The Arabidopsis AH OGG containing A1 and B1 (seeds). This process was continued Information Resource (www.arabidopsis.org). Among the top three until all qualified A. halleri genes were assigned to seed pairs of the GO categories (cellular components, molecular functions, and biologi- genes. cal processes), we analysed only biological processes. For each GO cat- egory in the biological processes category, the expected ratio was inferred to be the ratio between the number of genes assigned to the 2.6. Selection pressures in the A. halleri-lyrata and GO category and the number of genes not assigned to the GO category A. halleri lineages among all annotated A. thaliana genes. In each GO category, the ob- To infer selection pressure in the A. halleri–lyrata lineage, we fo- served ratio in each category of OGGs was compared with the ex- cussed on orthologous groups that followed the speciation process pected ratio by a v test for different categories of OGGs. The ratios among the genes of the five non-Arabidopsis species (B. rapa, were processed in the R software environment (www.r-project.org) B. stricta, C. grandiflora, C. rubella, E. salsugineum), A. thaliana, and normalized among different arrays using Z-scores calculated using A. lyrata and A. halleri. In each orthologous group, a multiple align- genescale in the R library. A heatmap was generated using heatmap.2 ment was made to match the coding regions with the computer pro- in the R library. To correct for multiple testing, the false discovery rate gramme MAFFT v7.215 with default parameters. Using the (FDR) was estimated by the R-library Q-VALUE software. The null phylogenetic tree and multiple alignment, we constructed the com- hypothesis was rejected if FDR values were < 0.01. mon ancestral sequences among A. thaliana, A. halleri and A. lyrata, and the common ancestral sequence between A. halleri and A. lyrata with codeml (model ¼ 1, NSsites ¼ 0) in PAML. For each pair of an- 3. Results and discussion cestral sequences, the synonymous (K ) and non-synonymous substi- tution rates (K ) were calculated using yn00 (code ¼ 0, weighting ¼ 3.1. Duplication frequency in the A. halleri-lyrata 0, commonf3x4 ¼ 0) in PAML. To determine whether the K /K A S lineage ratio was significantly <1, two maximum likelihood values were cal- To examine the frequency of duplication events in the A. halleri-lyrata culated with the K /K ratio fixed at 1 and with the K /K ratio as a A S A S lineage (Fig. 2), we first used 25,833 AL–AH OGGs based on Illumina free parameter. The ratio of the maximum likelihood values was paired-end DNA-sequencing reads. We searched for an orthologous then compared with the v distribution. A P-value representing the A. thaliana gene for each AL–AH OGG by BLASTP searches between deviation of the two models was then calculated for the A. halleri- AL and AH OGG protein sequences and annotated A. thaliana pro- lyrata lineage. tein sequences. When both the A. lyrata and A. halleri genes in an To infer selection pressure in the A. halleri lineage, we generated a AL–AH OGG had the best hit to the same A. thaliana gene, the A. tree file for the A. thaliana, A. lyrata and A. halleri genes. In each of thaliana gene was considered an orthologous candidate gene. Of the orthologous groups, a multiple alignment was made to match the 25,833 AL–AH OGGs, 93.3% (24,019/25,833) had orthologous can- coding regions by the computer programme MAFFT. Using the tree didate genes in A. thaliana. Among these, we searched for orthologous file and multiple alignment file, we constructed the common ancestral groups that were consistent with the species tree. To do this, the syn- sequences among A. thaliana, A. halleri,and A. lyrata with codeml onymous substitution rates (K ) were calculated among the A. thali- (model ¼ 1, NSsites ¼ 0) in PAML. Although we used a representa- ana, A. lyrata,and A. halleri genes in each orthologous group using tive A. halleri gene sequence to infer the ancestral sequence, proper A. yn00 in PAML. When the K between A. lyrata and A. halleri was halleri genes had sequence variation when they were lineage-specific lower than both the K between A. thaliana and A. lyrata and the K S S duplicated after the split from A. lyrata. From the Illumina DNA-se- between A. thaliana and A. halleri, we assumed that the topology of quencing reads mapped to A. halleri genes by BOWTIE2 version the orthologous group was consistent with the species tree. Among the 2.2.3 with default parameters, the sequence variation was examined 24,019 AL–AH OGGs, 22,602 (22,602/24,019 ¼ 94.1%) had the in each A. halleri gene by the Picard software with default parameters same topology as the species tree. In the 22,602 AL–AH OGGs, (http://broadinstitute.github.io/picard/). When a variable sequence 16,704 and 2,370 A. thaliana genes were uniquely and multiply as- was observed in the A. halleri genes, the codon sequence including the signed to AL–AH OGGs, respectively. To examine the duplication fre- variable nucleotide was concatenated into the representative A. halleri quency in A. halleri-lyrata lineage, we considered two evolutionary genes. Alignments between A. halleri genes including codons with scenarios for the 2,370 A. thaliana genes multiply assigned to 5,898 variable nucleotides and the ancestral sequence were modified by add- OGGs. One is that gene duplication events occurred in the A. halleri- ing codons of the ancestral sequence against concatenated codons in- lyrata lineage. That is, the assigned A. thaliana gene is orthologous to cluding variable nucleotides. K and K were calculated in each pair A S the AL–AH OGGs. The other is that gene loss events occurred in the of ancestral and A. halleri gene sequences including variable se- A. thaliana lineage after gene duplication in the common ancestor quences using yn00 in PAML. To determine whether the K /K ratio A S among A. thaliana, A. lyrata and A. halleri. In this case, the assigned was significantly <1, two maximum likelihood values were calculated A. thaliana gene is not orthologous to the AL–AH OGGs. To examine with the K /K ratio fixed at 1 and with the K /K ratio as a free pa- A S A S these two possible scenarios, we collected AL–AH OGGs to which an rameter using codeml in PAML. The ratio of the maximum likelihood A. thaliana gene was assigned as an orthologous gene. We then calcu- values was then compared with the v distribution. A P-value repre- lated the K between the A. thaliana gene and AL–AH OGG, and senting the deviation of the two models was then calculated for each searched for the closest pair of the AL–AH OGG against the A. thali- A. halleri gene. These analysis procedures are summarized in ana gene. When the K between the chosen AL-AH OGG and the Supplementary Figure S3. other AL-AH OGG was lower than the K in the closest pair, the other AL–AH OGG was defined as sharing the A. thaliana gene as an 2.7. Inference of over-represented gene ontologies orthologous gene. Of the 5,898 AL–AH OGGs, 497 had no ortholo- Assuming that A. halleri and A. lyrata genes have similar GO assign- gous A. thaliana genes and 5,401 had 2,370 orthologous A. thaliana ments in A. thaliana in the same OGG, we obtained gene ontology genes. In the following analyses, 22,105 (16,704 þ 5,401) AL–AH Downloaded from https://academic.oup.com/dnaresearch/advance-article-abstract/doi/10.1093/dnares/dsy005/4898128 by Ed 'DeepDyve' Gillespie user on 08 June 2018 6 Functional divergence after gene duplication OGG with 19,074 (16,704 þ 2,370) orthologous A. thaliana genes patens), rice (Oryza sativa) and poplar (Populus trichocarpa). The were defined as A. thaliana–A. lyrata–A. halleri (AT–AL–AH) OGGs gain rates were 1.8–3.5  10 in the three branches, 10 times (Supplementary Table S3). These analysis procedures are summarized lower than in the A. halleri-lyrata lineage (Fig. 2). One explanation in Supplementary Figure S4. for this gain rate is that even though many genes were fixed and re- In A. lyrata-halleri lineage, 22,105 AT–AL–AH OGGs were de- tained, a large number of them did not survive in the long run. This rived from the 19,074 genes, which represent the number of genes in explanation is consistent with the gradual decay of paralog synony- the common ancestor of A. thaliana, A. lyrata and A. halleri. Thus, mous substitution rates observed in several eukaryotes over 54,55 3,031 (22,10519,074) gains through gene duplication were sup- time. posed to have occurred in the A. halleri-lyrata lineage. The gain rate Note that the effect of tandemly duplicated genes on these gain (total gene duplication gains during a given time period divided by rates needs to be considered because tandemly duplicated genes have the estimated duration per gene) was inferred to be 1.8–2.0  10 undergone gene conversion, which leads to identical sequences (3,031 gains/19,074 genes/8–9 MY) in the A. halleri-lyrata lineage. among tandemly duplicated genes in the same species. Additionally, we examined 26,428 AL–AH OGGs including Consequently, our procedure may have failed to identify some pseudogene-like A. halleri genes (Supplementary Table S4), and iden- orthologous A. halleri genes among the A. lyrata tandemly dupli- tified 22,684 AT–AL–AH OGGs derived from the 19,318 ortholo- cated genes. In such cases, our method would have missed the dupli- gous A. thaliana genes. The gain rate was inferred to be 1.9–2.2  cation event in the A. halleri-lyrata lineage, meaning our gain rate is 10 (3,366 gains/19,318 genes/8–9 MY) in the A. halleri-lyrata line- underestimated in the A. lyrata-halleri lineage. Thus, tandemly dupli- age. The gain rate in OGGs including pseudogene-like genes cated genes did not disturb the trend of a higher gain rate in A. hal- (Fig. 4C) is similar to OGGs without pseudogene-like genes, indicat- leri-lyrata in comparison with other land plants. ing that addition of pseudogene-like A. halleri genes did not affect duplication frequency in the A. halleri-lyrata lineage. Second, we used 26,007 AL–AH OGGs based on the available A. 3.2. Duplication frequency in the A. halleri lineage halleri genome. Of 26,007 AL–AH OGGs, 91.0% (23,682/26,007) The copy numbers of the A. halleri genes were unclear from AT– had orthologous candidate genes in A. thaliana (Supplementary AL–AH OGGs. However, the absolute copy number of an A. halleri Table S5). Among the 23,682 AL–AH OGGs, 21,984 (21,984/ gene could be experimentally inferred by the absolute DNA concen- 23,682 ¼ 92.8%) had the same topology as the species tree. In the tration of the gene by ddPCR. First, to examine the relationship be- 21,984 AL–AH OGGs, 16,800 and 2,058 A. thaliana genes were tween copy number and DNA concentration, we focussed on known 41 58 uniquely and multiply assigned to AL–AH OGGs, respectively. Since three-copy genes, HMA4 and MTP1, and 10 singleton genes that 2,058 A. thaliana genes multiply assigned to 5,183 OGGs, 5,183 share a single copy in a broad range of plants. The concentration OGGs were classified into 446 OGGs which had no orthologous of HMA4 was 627.5 6 40.39 and 626.75 6 39.38 copies/ll (four A. thaliana genes and 4,737 OGGs which had 2,058 orthologous replicates for each of the two primer pairs, average of each primer A. thaliana genes. Following the above procedure, 21,537 (16,800 þ pair 6 95% CI). The concentration of MTP1 was 605.0 6 22.39. 4,737) AL–AH OGG with 18,858 (16,800 þ 2,058) orthologous Conversely, the concentration of the 10 singleton genes was 193.30 A. thaliana genes were defined as AT–AL–AH OGGs. The gain rate 6 9.06 copies/ll (four replicates for each of the 10 primer pairs, av- was inferred to be 1.6–1.8  10 (2,679 gains/18,858 genes/8–9 erage of all 10 primer pairs 6 95% CI). The average DNA concen- MY) in the A. halleri-lyrata lineage. Taken together, the gain rate tration was 3.2 times higher for the three-copy genes compared without pseudogene-like A. halleri genes was 1.6–2.0  10 in the with the single-copy genes (Fig. 3A), indicating that the copy number A. halleri-lyrata lineage (Fig. 2). corresponded to the DNA concentration inferred by ddPCR. In our previous study, we inferred the gain rate of gene duplica- The copy numbers of the A. halleri genes could also be inferred by tion in the lineage leading to Arabidopsis after the divergence of the reading depth of the Illumina DNA-sequencing reads. The read- moss. Using the same method, we re-estimated the gain rates in ing depth was defined as the RPKM. To examine whether RPKM three times periods—after the divergence of moss (Physcomitrella values corresponded with copy numbers, we focussed on 25 AB y = 0.7991x 500 1200 0 500 1000 1500 2000 2500 3-copied genes Single copied genes RPKM Figure 3. Relationship between the Illumina DNA-sequencing read depth and the copy number inferred by ddPCR. (A) The Y-axis represents copy numbers per ll inferred by ddPCR. Black circles (gray background) and open circles (black background) indicate three-copy genes and single-copy genes, respectively. All points and error bars represent averages of four replicates and 95% CIs. (B) Each dot represents an A. halleri gene. The X-axis represents the Illumina DNA-sequencing read depth, which is the number of reads per 1 Kbp per 1 million reads. The Y-axis represents copy numbers per ll, which were inferred by ddPCR. The regression line was calculated with the simple formula Y ¼ aX; a was inferred by the least squares method. Downloaded from https://academic.oup.com/dnaresearch/advance-article-abstract/doi/10.1093/dnares/dsy005/4898128 by Ed 'DeepDyve' Gillespie user on 08 June 2018 Concentrations (Copies/µl) Concentrations (Copies/µl) K. Hanada et al. 7 A. halleri genes that had a wide variety of RPKMs (see section 2). genes, we fosused on 1,159 AT–AL–AH OGGs composing only For these genes, the copy number estimated by digital droplet PCR pseudogene-like A. halleri genes among 22,684 AT–AL–AH OGGs (ddPCR) was compared with the RPKM. Consequently, we found including pseudogene-like genes. We then identified 713 pseudogene- that RPKM was significantly correlated with concentration (Fig. 3B, like A. halleri genes experiencing gene duplications in either the R ¼ 0.94). This result indicated that RPKM values could be used an A. halleri-lyrata or the A. halleri lineage. In the case of 22,105 index of copy numbers for A. halleri genes. AT–AL–AH OGGs without any pseudogene-like A. halleri genes, we To infer duplication frequencies by RPKM values, we focussed on identified 6,752 A. halleri genes experiencing gene duplications in ei- 285 AT–AL–AH OGGs with single-copy genes in a broad range of ther the A. halleri-lyrata or the A. halleri lineage. The proportion plant species such as Arabidopsis, Carica, Populus, Vitis, Oryza, (713/1,159 ¼ 62%) of duplicated genes in pseudogene-like A. halleri Selaginella and Physcomitrella. The RPKM values of the A. halleri genes was significantly higher than that (6,752/22,105 ¼ 31%) of genes showed a normal distribution (Supplementary Fig. S5A). To duplicated genes in the other A. halleri genes (P-value ¼ 1.9 107 2 call duplicated genes, the top 5% (309) of RPKMs was defined as the 10 by v test, Table 1), indicating that most of pseudogenes threshold for duplicated genes. That is, A. halleri genes with an tended to appear via gene duplication in Arabidopsis. Taken to- RPKM < 309 were defined as non-duplicated genes, A. halleri genes gether, it is likely that most of recently duplicated genes in with RPKM values from 309 to 618 (309  2) were defined as two- Arabidopsis may be on the process of decayed genes but some of du- copy genes, and A. halleri genes with RPKM values of 618–927 (309 plicated genes significantly contributed to functional divergence 3) were defined as three-copy genes. Following this rule, in 22,105 among Arabidopsis species. AT–AL–AH OGGs based on Illumina paired-end DNA-sequencing Using 21,537 AT–AL–AH OGGs based on the available A. halleri reads, we identified 2,488 multiply duplicated A. halleri genes and genome, we identified 1,248 gain events in 838 genes 3,378 gain events through gene duplications in the A. halleri lineage (Supplementary Table S5). The gain rate was 2.9  10 (1,248 (Supplementary Table S3). The gain rate (total gains from gene du- gains/21,537/2MY). The gain rate was approximately half in com- plication during a given time period divided by the estimated dura- parison with the gain rate (5.8–7.6  10 ) inferred by RPKMs. tion per gene) was 7.6  10 (3,378 gains/22,105 genes/2 MY). Therefore, copy number information inferred by ddPCR was com- However, the gain events may have been over-estimated because pared with the number of A. halleri lineage-specific duplicated genes. duplication events occurring in the A. halleri-lyrata lineage caused an Most of OGGs had single-copy genes in A. halleri although various increase of RPKMs. Therefore, excluding A. halleri genes that were duplication frequencies were inferred by ddPCR (Supplemental duplicated in the A. halleri-lyrata lineage, we counted the number of Fig. S5B). These results indicate that some of gene duplication tended gain events in the A. halleri lineage. Among the currently retained A. to be missed in a genome assembly based only on Illumina short halleri genes, we focussed on 16,946 A. halleri genes that had not un- reads. This shows that the reading depth of raw Illumina reads may dergone gene duplication in the A.halleri-lyrata lineage. These be informative for inferring the number of recently duplicated genes. 16,946 genes had undergone 1,958 gain events through gene duplication in the A. halleri lineage. The re-estimated gain rate was 3.3. Selection pressures in the A. halleri-lyrata lineage 5.8  10 (1,958 gains/16,946 genes/2 MY). The gain rate (5.8–7.6 10 ) of duplicate genes in the A. halleri lineage was approxi- We found that 23–30% of OGGs had undergone gene duplications mately four times higher than the gain rate (1.6–2.0  10 ) in the in the Arabidopsis lineage at high gain rates (1.6–7.6  10 gains/ A. halleri-lyrata lineage (Fig. 2). As mentioned earlier, the gain rate gene/MY), which were 10 times higher than the rates inferred by difference can be explained by a rapid decay of duplicated genes. comparative genomics among Arabidopsis, poplar, rice and moss However, this gain rate did not include gene duplications derived (Fig. 2). Therefore, we were interested in investigating the functional from pseudogenes which were on the earlier process of decayed divergence of duplicated genes in Arabidopsis. To infer the func- genes. To determine the decayed effect in the A. halleri lineage, we tional divergence of duplicated genes in the A. halleri-lyrata lineage, focussed on 22,684 AT–AL–AH OGGs including pseudogene-like we tried to infer the ancestral sequences of the most recent common genes (Supplementary Table S4). Out of 22,684 OGGs, the gain rate ancestor of A. thaliana, A. lyrata and A. halleri. To define the node in the A. halleri lineage was 1.2  10 (5,665 gains/22,684 genes/2 of the most recent common ancestor among A. thaliana, A. lyrata, MY). Thus, the gain rate including pseudogenes-like A. halleri genes and A. halleri, we used an orthologous non-Arabidopsis gene as an was approximately two times higher than the gain rate (7.6  10 ) outgroup sequence for each of AT–AL–AH OGGs and performed without any pseudogene-like A. halleri genes. This result indicates BLASTP searches of AT–AL–AH protein sequences against all five that pseudogene-like A. halleri genes accelerated the gain rates in the non-Arabidopsis (B. rapa, B. stricta, C. grandiflora, C. rubella, A. halleri lineage. To determine whether pseudogenes tended to have E. salsugineum) protein sequences. When the best-hit non- a higher duplication rate in comparison with non-pseudogene-like Arabidopsis gene was the same for the A. thaliana, A. lyrata and Table 1. Comparison of gene duplication to non-gene duplication ratio in either A. halleri-lyrata or A. halleri lineage between OGGs including pseudogene-like A. halleri genes and OGGs without any pseudogene-like A. halleri genes Gene duplication No gene duplication D/N ratio P value (v test) in either A. halleri- in either A. halleri- lyrata or A. halleri lyrata or A. halleri lineage (D) lineage (N) OGGs derived from only pseudogenes 713 446 1.60 1.9  10 OGGs without any pseudogenes 6,752 15,353 0.44 Downloaded from https://academic.oup.com/dnaresearch/advance-article-abstract/doi/10.1093/dnares/dsy005/4898128 by Ed 'DeepDyve' Gillespie user on 08 June 2018 8 Functional divergence after gene duplication A. halleri genes, the non-Arabidopsis gene was considered a candi- lineage, the proportions of positive selection and purifying selection date orthologous gene to the AT–AL–AH OGG. For each of candi- in 2,488 duplicated genes were compared with those in 19,617 date genes, we searched for candidate orthologous groups that were non-duplicated genes. In this test, the null model was the ratio of du- consistent with the species tree. To do this, we generated a phyloge- plicated genes to non-duplicated genes in the A. halleri lineage. As netic tree by the neighbour-joining method using the PAUP software observed in the A. halleri-lyrata lineage, the proportions of positive 60,61 (set outroot ¼ mono, dset distance ¼ hky). When the topology selection in the duplicated genes were significantly higher than those of the gene tree was the same as that of the species tree, we assumed in the non-duplicated genes (P-value ¼ 2.3  10 for positive selec- that the topology of the orthologous group was consistent with the tion by v test, Fig. 4C), while the proportion of purifying selection species tree. There were 22,105 and 21,537 AT–AL–AH OGGs in the non-duplicated genes was significantly higher than in the du- 26 2 based on Illumina paired-end DNA-sequencing reads and the avail- plicated genes (P-value ¼ 1.9  10 by v test, Fig. 4C). We did able A. halleri genomes, respectively. Out of 22,105 and 21,537 the same analysis in 21,537 AT–AL–AH OOGs based on the avail- AL–AH OGGs, we identified 17,669 and 17,925 orthologous groups able A. halleri genome (Supplementary Table S5). We found the that followed the speciation process among non-Arabidopsis, same trend in the OOGs (P-value ¼ 9.9  10 for positive selec- 18 2 A. thaliana, A. lyrata, and A. halleri genes, respectively tion, P-value ¼ 2.2  10 for purifying selection by v test, (Supplementary Table S3). These OGGs were defined as non-A-AT– Fig. 4D). These results indicate that gene duplication induced func- AL–AH OGGs. These analysis procedures are summarized in tional divergence in the A. halleri lineage as well. Supplementary Figure S6. Functional divergence may not occur immediately after gene du- Based on the phylogenetic tree in each non-A-AT–AL–AH OGG, plication. To determine whether gene duplication in the A. halleri- we inferred the ancestor sequences of all nodes using codeml in the lyrata lineage affected selection pressures in the A. halleri lineage, we PAML package, and calculated K and K in the A. halleri-lyrata classified the 22,105 AT–AL–AH OGGs based on Illumina paired- A S lineage. First, among the 17,669 OGGs base on Illumina paired-end end DNA-sequencing reads into 5,159 OGGs with gene duplications DNA-sequencing reads, we found 481, 8,313 and 8,875 genes with and 16,946 OGGs without any gene duplications, focusing on the positive selection (K /K > 1), purifying selection (K /K < 1) and A. halleri-lyrata lineage. The proportions of positive selection and A S A S unclear selection respectively, by the maximum likelihood approach purifying selection in the duplicated genes in the A. halleri lineage (Supplementary Table S3). To address whether the duplicated genes were compared with those in the non-duplicated genes. In this test, contributed to functional divergence in the A. halleri-lyrata lineage, the null model was the ratio of duplicated genes to non-duplicated the proportions of positive selection and purifying selection in 1,782 genes in the A. halleri-lyrata lineage. The proportion of positive se- duplicated genes were compared with those in 15,887 non- lection in the duplicated genes was significantly higher than those in duplicated genes. In this test, the null model was the ratio of dupli- the non-duplicated genes (P-value ¼ 1.3  10 for positive selec- cated genes to non-duplicated genes without any particular selection tion by v test, Fig. 4E). The proportion of purifying selection in the pressure in the A. halleri-lyrata lineage. The proportions of positive non-duplicated genes was significantly higher than in the duplicated 85 2 selection in the duplicated genes was significantly higher than those genes (P-value ¼ 1.9  10 by v test, Fig. 4E). We did the same 57 2 in the non-duplicated genes (P-value ¼ 2.8  10 by v test, analysis in 21,537 based on the available A. halleri genome. We Fig. 4A). The proportion of purifying selection in the non-duplicated again found the same trend (P-value ¼ 4.8  10 for positive selec- 39 2 genes was significantly higher than in the duplicated genes in the tion, P-value ¼ 1.1  10 for purifying selection by v test, 33 2 A. halleri-lyrata lineage (P-value ¼ 1.2  10 by v test, Fig. 4A). Fig. 4F). These results indicate that gene duplication in the A. halleri- Furthermore, we did the same analysis in 17,925 OOGs based on lyrata lineage contributed to functional divergence in the A. halleri the available A. halleri genome (Supplementary Table S5). We found lineage. the same trend for only positive selection in the OOGs (P-value ¼ To determine whether gene duplications in the A. halleri-lyrata or 3.0  10 for positive selection, P-value ¼ 0.05 for purifying selec- A. halleri lineage contributed to functional divergence in the A. hal- tion by v test, Fig. 4B). These results indicate that gene duplication leri lineage, we focussed on two categories of OGGs—genes not du- induced functional divergence in the A. halleri-lyrata lineage. plicated in the A. halleri-lyrata lineage but duplicated in the A. halleri lineage (1,593 OGGs), and genes duplicated in the A. halleri- lyrata lineage but not duplicated in the A. halleri lineage (4,264 3.4. Selection pressure in the A. halleri lineage OGGs). The proportions of positive selection and purifying selection To infer selection pressure in the A. halleri lineage, we generated an- in the A. halleri lineage were compared in the two categories. In this cestral sequences of the A. lyrata and A. halleri genes using A. thali- test, the null model was the ratio of the two categories of OGGs ana genes as outgroups. When an A. halleri gene did not have any (1,593 and 4,264 OGGs). The proportions of positive selection in sequence variation in the Illumina DNA-sequencing reads, K and genes duplicated only in the A. halleri-lyrata lineage were signifi- K were simply calculated by comparing the ancestral sequence to cantly higher than in genes duplicated only in the A. halleri lineage 9 2 the representative A. halleri gene sequence. When sequence variation (P-value ¼ 7.3  10 for positive selection by v test, Fig. 4G). for a representative A. halleri gene sequence was identified from the Conversely, the proportion of purifying selection in genes duplicated Illumina DNA-sequencing reads, K and K were separately calcu- only in the A. halleri-lyrata lineage was significantly lower than in S A lated for the varied sequences (see Materials and Methods, genes duplicated only in the A. halleri lineage (P-value ¼ 2.9  10 Supplementary Fig. S4). Among the 22,105 AT–AL–AH OGGs by v test, Fig. 4G). We did the same analysis in 21,537 based on based on Illumina paired-end DNA-sequencing reads, we found based on the available A. halleri genome. We found the same trend 568, 7,717, and 13,820 genes with positive selection (K /K > 1), (P-value ¼ 0.04 for positive selection, P-value ¼ 0.02 for purifying A S purifying selection (K /K < 1) and unclear selection, respectively, in selection in the available genome by v test, Fig. 4H). These results A S the A. halleri lineage by the maximum likelihood approach indicate that gene duplication in the A. halleri-lyrata lineage was the (Supplementary Table S3). To examine the relationship between the main determinant of the elevated proportion of positive selection in effect of gene duplication and selection pressures in the A. halleri the A. halleri lineage. Downloaded from https://academic.oup.com/dnaresearch/advance-article-abstract/doi/10.1093/dnares/dsy005/4898128 by Ed 'DeepDyve' Gillespie user on 08 June 2018 K. Hanada et al. 9 AB Gene duplication in A. halleri-lyrata lineage 80% Gene duplication in A. halleri-lyrata lineage 100% Non-gene duplication in A. halleri-lyrata lineage Non-gene duplication in A. halleri-lyrata lineage 80% 60% 60% 40% 40% 20% 20% 0% 0% Positive Unclear Purifying Positive Unclear Purifying selection selection selection selection selection selection A. halleri-lyrata lineage A. halleri-lyrata lineage CD Gene duplication in A. halleri lineage Gene duplication in A. halleri lineage 80% 80% Non-gene duplication in A. halleri lineage Non-gene duplication in A. halleri lineage 60% 60% 40% * 40% 20% 20% 0% 0% Positive Unclear Purifying Positive Unclear Purifying selection selection selection selection selection selection A. halleri lineage A. halleri lineage EF 80% Gene duplication in A. halleri-lyrata lineage Gene duplication in A. halleri-lyrata lineage 80% Non-gene duplication in A. halleri-lyrata lineage Non-gene duplication in A. halleri-lyrata lineage 60% 60% 40% 40% * 20% 20% * * 0% 0% Positive Unclear Purifying Positive Unclear Purifying selection selection selection selection selection selection A. halleri lineage A. halleri lineage Gene duplication in A. halleri-lyrata lineage Gene duplication in A. halleri-lyrata lineage GH but non-gene duplication in A. halleri lineage but non-gene duplication in A. halleri lineage Non-gene duplication in A. halleri-lyrate lineage Non-gene duplication in A. halleri-lyrate lineage 80% 80% but gene duplication in A. halleri lineage but gene duplication in A. halleri lineage 60% 60% 40% 40% 20% 20% * * 0% 0% Positive Unclear Purifying Positive Unclear Purifying selection selection selection selection selection selection A. halleri lineage A. halleri lineage Figure 4. Gene duplication and selection pressure in the A. halleri-lyrata and A. halleri lineages. Genes were classified as being under positive selection (K /K > 1), unclear selection or purifying selection (K /K < 1) in the A. halleri-lyrata and A. halleri lineages. Asterisks (*) represent significant differences by A S A S the chi-square test (P < 0.05). (A) The relationship between gene duplication and selection pressure in the A. lyrata-halleri lineage in 17,669 OGGs base on Illumina paired-end DNA-sequencing reads. (B) The relationship between gene duplication and selection pressure in the A. lyrata-halleri lineage in 17,925 OOGs based on the available A. halleri genome. (C) The relationship between gene duplication and selection pressure in the A. halleri lineage in 22,105 AT–AL–AH OGGs based on Illumina paired-end DNA-sequencing reads. (D) The relationship between gene duplication and selection pressure in the A. halleri lineage in 21,537 AT–AL–AH OOGs based on the available A. halleri genome. (E) The relationship between gene duplication in the A. lyrata-halleri lineage and selection pressure in the A. halleri lineage in 22,105 AT–AL–AH OGGs based on Illumina paired-end DNA-sequencing reads. (F) The relationship between gene duplication in the A. lyrata-halleri lineage and selection pressure in the A. halleri lineage in 21,537 AT–AL–AH OOGs based on the available A. halleri genome. (G) Comparison of selection pressure in the A. halleri lineage between gene duplication in only the A. lyrata-halleri lineage and gene duplication in only the A. halleri lineage in OGGs base on Illumina paired-end DNA-sequencing reads. (H) Comparison of selection pressure in the A. halleri lineage between gene du- plication in only the A. lyrata-halleri lineage and gene duplication in only the A. halleri lineage in OGGs based on the available A. halleri genome. Downloaded from https://academic.oup.com/dnaresearch/advance-article-abstract/doi/10.1093/dnares/dsy005/4898128 by Ed 'DeepDyve' Gillespie user on 08 June 2018 10 Functional divergence after gene duplication GO categories assigned to the A. thaliana genes belonging to the OGGs, 3.5. Functional bias of genes under positive or over-represented GO categories were identified (Supplementary Table purifying selection in the A. halleri-lyrata and A. halleri S6;see Section 2,FDR < 0.01). lineages When we focussed on metal and zinc tolerance or accumulation We found that gene duplication contributed to functional divergence among A. thaliana, A. lyrata and A. halleri, both A. lyrata and A. in comparison with non-duplicated genes in Arabidopsis but many halleri had a higher tolerance to metal and zinc than A. thaliana. In duplicated genes had been retained with purifying selection, which in- particular, both metal and zinc tend to be accumulated in A. halleri duces an increase of gene dosage. Thus, we were interested in investi- 62,63 compared with A. lyrata. This observation suggests that metal gating the functional bias in duplicated and non-duplicated genes tolerance is enhanced in the A. halleri–A. lyrata lineage, and that with positive/purifying selection in the A. halleri-lyrata and A. halleri metal accumulation is enhanced in the A. halleri lineage. Genes asso- lineages. OGGs based on Illumina paired-end DNA-sequencing reads ciated with metal and zinc transporters/responses were highly dupli- were classified into duplicated and non-duplicated genes in the A. hal- cated with purifying selection in the A. halleri-lyrata lineage, leri-lyrata and A. halleri lineages. The OGGs were then further classi- indicating that the dosages of genes associated with metal responses fied by positive and purifying selection. This gave a total of eight and transporters had been enhanced in the A. halleri-lyrata lineage. classes. In each class, we examined the degree of over-representation Furthermore, genes associated with metal and zinc responses were in 1,504 GO categories compared with the number of GO categories highly duplicated with purifying selection in the A. halleri-lyrata line- assigned to A. thaliana genes (see Section 2) (Fig. 5). Although two age, indicating that the dosages of genes associated with metal and classes with no duplication and purifying selection were clustered into zinc responses had been enhanced in the A. halleri-lyrata lineage. one group, the other six classes were not essentially clustered with Thus, tolerance or accumulation of metal and zinc was enhanced in each other. These results indicate that the evolutionary direction of the A. halleri-lyrata and A. halleri lineages through gene duplication. A. halleri was quite different in the A. halleri-lyrata and A. halleri lin- Genes associated with the reproductive system, cell cycle, various de- eages with respect to either gene dosage by gene duplication or func- velopments, various metabolites, epigenetics, metabolites, abiotic and tional divergence by positive selection. biotic responses were subject to positive selection in either the A. hal- To examine the kinds of genes associated with phenotypic differences leri-lyrata or A. halleri lineage (Supplementary Table S6). However, we among A. thaliana, A. lyrata and A. halleri, we identified significantly do not have any clear idea why such genes were subject to positive over-represented GO categories in OGGs with gene duplication and pu- selection. Additionally, genes associated with various ubiquitous pro- rifying selection, OGGs with gene duplication and positive selection and cesses tended to be duplicated in either the A. halleri-lyrata or the A. OGGs with non-duplication and positive selection in the A. halleri–A. halleri lineage with purifying selection (Supplementary Table S6). These lyrata and A. halleri lineages. OGGs with non-duplication and purifying over-represented functional categories may contribute to phenotypic selection were disregarded in the analysis because such genes tended to differences between A. thaliana and either A. lyrata or A. halleri have the same functions in A. halleri, A. lyrata and A. halleri.From the Color Key −2 −1 0 1 2 Figure 5. Over-represented functional categories. The X-axis represents different kinds of OGGs. OGGs were classified into duplicated genes and non-dupli- cated genes in the A. halleri-lyrata (AHL) and A. halleri (AH) lineages. The OGGs were then further classified by positive or purifying selection. The Y-axis repre- sents 1,504 GO categories (biological processes) assigned to A. thaliana genes belonging to the OGGs. The key shows the relationship between colour and the z-score of over-representation in the GO categories. Red and green indicate low and high over-representation, respectively. The ratio of observed gene numbers in selected OGGs to expected gene numbers inferred from the data for all annotated genes was calculated in each GO category. Downloaded from https://academic.oup.com/dnaresearch/advance-article-abstract/doi/10.1093/dnares/dsy005/4898128 by Ed 'DeepDyve' Gillespie user on 08 June 2018 Duplication with positive selection in AH lineage Non-Duplication with purifying selection in AHL lineage Non-Duplication with purifying selection in AH lineage Non-Duplication with positive selection in AHL lineage Non-Duplication with positive selection in AH lineage Duplication with positive selection in AHL lineage Duplication with purifying selection in AH lineage Duplication with purifying selection in AHL lineage K. Hanada et al. 11 through high dosages. However, we do not know the phenotypic differ- result indicates that gene duplication tends to enhance functional di- ences associated with these functional categories. These duplicated vergence in comparison with non-duplicated genes in the genes might have been retained because increased gene dosages associ- Arabidopsis lineage. In contrast, the general observation is that du- ated with these functional categories were not too disadvantageous for plicated genes tend to have less functional divergence in yeasts, 24,68,69 A. halleri. To avoid gaining novel functions, these genes may be under plants and mammals. This is because functionally important purifying selection. In the future, these duplicated genes may be lost if genes are more likely to be retained as duplicates. This contradic- disadvantageous functions appear. Indeed, the A. halleri-lyrata and A. tory relationship might derive from the duplication ages. Most previ- halleri lineages have extraordinarily high retention rates of duplicated ous analyses have examined recently observed selection pressures in genes in comparison with earlier plant lineages. These observations in- anciently duplicated genes. When functional divergence was exam- dicate that most of the duplicated genes in the A. halleri-lyrata and ined in recently duplicated genes, the duplicated genes tended to have A. halleri lineages may be lost in future evolution. higher functional divergence than singletons. Together, these re- sults suggest duplicated genes tend to have higher functional diver- gence immediately after duplication than singletons. 3.6. Concluding remarks How long gene duplication accelerates functional divergence re- In this analysis, we generated 25,833 A. halleri genes that were mains an open question. To address this, we examined whether gene orthologous to 79.1% of the annotated A. lyrata genes from contigs duplication in the A. halleri-lyrata lineage (2–10 MYA) accelerated generated from only paired-end reads of Illumina DNA-sequencing. functional divergence in the A. halleri lineage (<2 MYA). On the other hands, we inferred 26,007 AH-AL OOGs based on the Interestingly, we found that gene duplication in the A. halleri-lyrata available A. halleri genome. Out of 32,670 A. lyrata genes, 79.6% lineage enhanced the proportion of positive selection in the A. halleri were identified as orthologous genes to A. halleri genes. Thus, our lineage (Fig. 4E–H). This result indicated that the functional diver- method inferring orthologous genes is compatible to a method to in- gence of duplicated genes was accelerated several MY after gene du- fer orthologous genes based on the available genome. However, it plication. If gene duplication is too deleterious for a gene, the gene has significant limitations for examining gene loss, exon shuffling tends to be lost immediately after duplication. If not, duplicated and heterozygosity because it does not infer any new genes in the ge- genes may be retained for a long period without functional diver- nomes. Nevertheless, the number of duplicated genes inferred by gence because functional divergence may be evolutionarily disadvan- reading depth of Illumina DNA-sequencing reads is likely to be more tageous. Therefore, immediately after duplication, most duplicated reliable in comparison with duplicated genes identified on scaffolds genes might be under functional constraints in comparison with generated by Illumina DNA-sequencing reads (Supplementary genes duplicated several MY earlier. Indeed, many recently dupli- 17,18 Fig. S5B). Therefore, this procedure is useful for inferring lineage- cated genes have functional redundancy in A. thaliana and in specific duplicated genes resulting from such events as SSDs. mammals. These young duplicated genes tend to be less function- The gain rates based on 25, 833 AL–AH OGGs based on both ally constrained than singletons, and may have the potential to ob- Illumina paired-end DNA-sequencing reads and the available A. hal- tain an essential function to survive in new environments. leri genome were 1.6–2.0  10 per gene per MY in the A. halleri- Finally, we examined the kinds of genes that were duplicated lyrata lineage (Fig. 2). Furthermore, using the only mapping coverage and/or under positive selection in the A. halleri-lyrata and A. halleri of the Illumina DNA-sequencing reads, the gain rate was inferred to lineages. Different functional categories tended to have experienced be 5.7–7.6  10 in the A. halleri lineage because gene duplication gene duplication and/or selection pressure in the A. halleri-lyrata and tended to be missed in a genome assembly based on Illumina short A. halleri lineages. For example, A. halleri is known as a heavy metal reads. That is, the gain rates were inferred to be 1.6–2.0 and 5.7–7.6 hyper-accumulator with high metal tolerance. A. lyrata is tolerant of 10 per gene per MY in the A. halleri-lyrata and A. halleri line- heavy-metal ions in the soil to some degree but A. thaliana is not. ages, respectively. The gain rate in the A. halleri lineage was approxi- Genes related to heavy-metal tolerance and accumulation tended to mately four times higher than in the A. halleri-lyrata lineage. Using be highly duplicated with purifying selection in the A. halleri-lyrata our previous data, we re-estimated the gain rates in the three time pe- and A. halleri lineages. Earlier studies reported that metal tolerance riods after the divergence of mosses, rice and poplar (Fig. 2). The in- was enhanced by increasing gene dosage through gene duplication. ferred gain rates (1.8–3.0  10 ) were 10 times lower than in the Our results supported this trend at a genomic scale. Taken together, Arabidopsis lineage (Fig. 2). Thus, gain rates tend to increase as the the results of our study reveal that lineage-specific duplicated genes evolutionary period gets younger. One explanation for this gain rate have contributed to species-specific evolution in Arabidopsis. difference is that duplicated genes tend to rapidly decay over time. This explanation is supported by a higher rate of pseudogenization in recently duplicated genes in comparison with non-duplicated 4. Availability genes (Table 1). Also, several previous reports showed that younger Illumina DNA-sequencing data (DRA004564) have been deposited in duplicated genes tended to be relaxed compared with older dupli- the DDBJ Sequence Read Archive (https://trace.ddbj.nig.ac.jp/dra/). 54,55,64–67 cated genes. That is, most anciently duplicated genes tend Contig sequences (BFAE01000001-BFAE01344622) assembled from not to be retained in current species. Consequently, the gain rates in- the Illumina DNA-sequencing data have been deposited in the DDBJ ferred in earlier evolutional periods tend to decrease. Mass Submission System. A. halleri gene sequences determined in this To investigate the functional divergence of duplicated genes in the study are included in Supplementary Tables S3 and S4. A. halleri-lyrata and A. halleri lineages, we identified OGGs under ei- ther positive or purifying selection in these lineages based on the ra- tio of nonsynonymous and synonymous substitution rates (K /K ). A S Acknowledgements Interestingly, the proportions of positive and purifying selection tended to increase and decrease, respectively, when gene duplication We thank Kiyomi Imamura, Makiko Tosaka, Taiji Kikuchi, Terumi Horiuchi, occurred in either the A. halleri-lyrata or A. halleri lineage. This Kanako Onizuka and Miu Kubota for Illumina DNA-sequencing analyses and Downloaded from https://academic.oup.com/dnaresearch/advance-article-abstract/doi/10.1093/dnares/dsy005/4898128 by Ed 'DeepDyve' Gillespie user on 08 June 2018 12 Functional divergence after gene duplication ddPCR analyses. We also thank the National Institute of Genetics of the 17. Hanada, K., Kuromori, T., Myouga, F., Toyoda, T., Li, W. H. and Research Organization of Information and Systems for providing excellent su- Shinozaki, K. 2009, Evolutionary persistence of functional compensation percomputer services. This work was supported by the Research for by duplicate genes in Arabidopsis, Genome Biol. Evol., 1, 409–14. Evolutional Science and Technology (CREST) programme ‘Creation of essen- 18. Hanada, K., Sawada, Y., Kuromori, T., et al. 2011, Functional compensa- tial technologies to utilize carbon dioxide as a resource through the enhance- tion of primary and secondary metabolites by duplicate genes in ment of plant productivity and the exploitation of plant products’ of the Japan Arabidopsis thaliana, Mol. Biol. Evol., 28, 377–82. Science and Technology Agency (JST) (JPMJCR11B3 to K.H. and S.I.M.); 19. Nowak, M. A., Boerlijst, M. C., Cooke, J. and Smith, J. M. 1997, Grants-in-Aid for Scientific Research (15K14421, 15H02433, K.H. and Evolution of genetic redundancy, Nature, 388, 167–71. S.I.M.) and research grants from the Takeda Science Foundation, the 20. Hanada, K., Kuromori, T., Myouga, F., Toyoda, T. and Shinozaki, K. Sumitomo Foundation, the Kurume Research Park and the Asahi Glass 2009, Increased expression and protein divergence in duplicate genes is as- Foundation (to K.H.). sociated with morphological diversification, PLoS Genet., 5, e1000781. 21. Alvarez-Ponce, D. and Fares, M. A. 2012, Evolutionary rate and duplic- ability in the Arabidopsis thaliana protein-protein interaction network, Conflict of interest Genome Biol. Evol., 4, 1263–74. 22. Warren, A. S., Anandakrishnan, R. and Zhang, L. 2010, Functional bias in None declared. molecular evolution rate of Arabidopsis thaliana, BMC Evol. Biol., 10,125. 23. Wang, J., Marowsky, N. C. and Fan, C. 2013, Divergent evolutionary and expression patterns between lineage specific new duplicate genes and Supplementary data their parental paralogs in Arabidopsis thaliana, PLoS One, 8, e72362. Supplementary data are available at DNARES online. 24. Yang, L. and Gaut, B. S. 2011, Factors that contribute to variation in evo- lutionary rate among Arabidopsis genes, Mol. Biol. Evol., 28, 2359–69. 25. Wendel, J. F., Jackson, S. A., Meyers, B. C. and Wing, R. A. 2016, References Evolution of plant genome architecture, Genome Biol. , 17, 37. 26. DeBarry, J. D. and Kissinger, J. C. 2014, A survey of innovation through 1. Lockton, S. and Gaut, B. S. 2005, Plant conserved non-coding sequences duplication in the reduced genomes of twelve parasites, PLoS One, 9, and paralogue evolution, Trends Genet., 21, 60–5. e99213. 2. Vanneste, K., Baele, G., Maere, S. and Van de Peer, Y. 2014, Analysis of 27. Sin, K., Street, N., Lundeberg, J. and Arvestad, L. 2012, Improved gap 41 plant genomes supports a wave of successful genome duplications in size estimation for scaffolding algorithms, Bioinformatics, 28, 2215–22. association with the Cretaceous-Paleogene boundary, Genome Res., 24, 28. Arabidopsis_Genome_Initiative 2000, Analysis of the genome sequence of 1334–47. the flowering plant Arabidopsis thaliana, Nature, 408, 796–815. 3. Hanada, K., Vallejo, V., Nobuta, K., et al. 2009, The functional role of 29. Hu, T. T., Pattyn, P., Bakker, E. G., et al. 2011, The Arabidopsis lyrata pack-MULEs in rice inferred from purifying selection and expression pro- genome sequence and the basis of rapid genome size change, Nature file, Plant Cell, 21, 25–38. Genetics, 43, 476–81. 4. Hanada, K., Zou, C., Lehti-Shiu, M. D., Shinozaki, K. and Shiu, S. H. 30. Beilstein, M. A., Nagalingum, N. S., Clements, M. D., Manchester, S. R. 2008, Importance of lineage-specific expansion of plant tandem duplicates and Mathews, S. 2010, Dated molecular phylogenies indicate a Miocene in the adaptive response to environmental stimuli, Plant Physiol., 148, origin for Arabidopsis thaliana, Proc. Natl. Acad. Sci. U. S. A., 107, 993–1003. 18724–8. 5. Fortna, A., Kim, Y., MacLaren, E., et al. 2004, Lineage-specific gene du- 31. Moghe, G. D., Hufnagel, D. E., Tang, H., et al. 2014, Consequences of plication and loss in human and great ape evolution, PLoS Biol., 2, E207. whole-genome triplication as revealed by comparative genomic analyses 6. Rostoks, N., Borevitz, J. O., Hedley, P. E., et al. 2005, Single-feature poly- of the wild radish raphanus raphanistrum and three other Brassicaceae morphism discovery in the barley transcriptome, Genome Biol., 6,R54. species, Plant Cell, 26, 1925–37. 7. Clark, R. M., Schweikert, G., Toomajian, C., et al. 2007, Common se- 32. Koenig, D. and Weigel, D. 2015, Beyond the thale: comparative genomics quence polymorphisms shaping genetic diversity in Arabidopsis thaliana, and genetics of Arabidopsis relatives, Nat. Rev. Genet., 16, 285–98. Science, 317, 338–42. 33. Kramer, U. 2010, Metal hyperaccumulation in plants, Ann. Rev. Plant 8. Rizzon, C., Ponger, L. and Gaut, B. S. 2006, Striking similarities in the ge- Biol., 61, 517–34. nomic distribution of tandemly arrayed genes in Arabidopsis and rice, 34. Verbruggen, N., Hermans, C. and Schat, H. 2009, Molecular mechanisms PLoS Comput Biol., 2, e115. of metal hyperaccumulation in plants, New Phytol., 181, 759–76. 9. Rodgers-Melnick, E., Mane, S. P., Dharmawardhana, P., et al. 2012, 35. Shimizu, K. K. and Purugganan, M. D. 2005, Evolutionary and ecological Contrasting patterns of evolution following whole genome versus tandem genomics of Arabidopsis, Plant Physiol., 138, 578–84. duplication events in Populus, Genome Res., 22, 95–105. 36. Turner, T. L., Bourne, E. C., Von Wettberg, E. J., Hu, T. T. and Nuzhdin, 10. Leister, D. 2004, Tandem and segmental gene duplication and recombina- S. V. 2010, Population resequencing reveals local adaptation of tion in the evolution of plant disease resistance gene, Trends Genet., 20, Arabidopsis lyrata to serpentine soils, Nat. Genet., 42, 260–3. 116–22. 37. Novikova, P. Y., Hohmann, N., Nizhynska, V., et al. 2016, Sequencing of 11. Ohno, S. 1970, Evolution by Gene Duplication. Springer-Verlag: New the genus Arabidopsis identifies a complex history of nonbifurcating spe- York. ciation and abundant trans-specific polymorphism, Nat. Genet, 48, 12. Force, A., Lynch, M., Pickett, F. B., Amores, A., Yan, Y. L. and 1077–82. Postlethwait, J. 1999, Preservation of duplicate genes by complementary, 38. Briskine, R. V., Paape, T., Shimizu-Inatsugi, R., et al. 2017, Genome as- degenerative mutations, Genetics, 151, 1531–45. sembly and annotation of Arabidopsis halleri, a model for heavy metal 13. Zou, C., Lehti-Shiu, M. D., Thibaud-Nissen, F., Prakash, T., Buell, C. R. hyperaccumulation and evolutionary ecology, Mol. Ecol. Resour., 17, and Shiu, S. H. 2009, Evolutionary and expression signatures of pseudo- 1025–36. genes in Arabidopsis and rice, Plant Physiol., 151, 3–15. 39. Akama, S., Shimizu-Inatsugi, R., Shimizu, K. K. and Sese, J. 2014, 14. Zheng, D. and Gerstein, M. B. 2007, The ambiguous boundary between Genome-wide quantification of homeolog expression ratio revealed non- genes and pseudogenes: the dead rise up, or do they? Trends Genet., 23, stochastic gene regulation in synthetic allopolyploid Arabidopsis, Nucleic 219–24. Acids Res., 42, e46. 15. Conant, G. C. and Wolfe, K. H. 2008, Turning a hobby into a job: how 40. Sato, Y. and Kudoh, H. 2014, Fine-scale genetic differentiation of a tem- duplicated genes find new functions, Nat. Rev. Genet., 9, 938–50. perate herb: relevance of local environments and demographic change, 16. Kondrashov, F. A. 2012, Gene duplication as a mechanism of genomic ad- AoB Plants, 6, plu070. aptation to a changing environment, Proc. Biol. Sci., 279, 5048–57. Downloaded from https://academic.oup.com/dnaresearch/advance-article-abstract/doi/10.1093/dnares/dsy005/4898128 by Ed 'DeepDyve' Gillespie user on 08 June 2018 K. Hanada et al. 13 41. Hanikenne, M., Talke, I. N., Haydon, M. J., et al. 2008, Evolution of co-segregate with zinc tolerance and account for high MTP1 transcript metal hyperaccumulation required cis-regulatory changes and triplication levels, Plant J., 39, 425–39. of HMA4, Nature, 453, 391–5. 59. Duarte, J. M., Wall, P. K., Edger, P. P., et al. 2010, Identification of shared 42. Al-Shehbaz, I. A. and O’Kane, S. L., Jr 2002, Taxonomy and phylogeny single copy nuclear genes in Arabidopsis, Populus, Vitis and Oryza and of Arabidopsis (Brassicaceae). In The Arabidopsis Book/American Society their phylogenetic utility across various taxonomic levels, BMC Evol. of Plant Biologists, Vol. 1. Biol., 10, 61. 43. Boratyn, G. M., Camacho, C., Cooper, P. S., et al. 2013, BLAST: a more 60. Wilgenbusch, J. C. and Swofford, D. 2003, Inferring evolutionary trees efficient report with usability improvements, Nucleic Acids Res., 41, with PAUP*, Curr. Protoc. Bioinformatics, Chapter 6, Unit 6 4. W29–33. 61. Saitou, N. and Nei, M. 1987, The neighbor-joining method: a new 44. Johnston, J. S., Pepper, A. E., Hall, A. E., et al. 2005, Evolution of genome method for reconstructing phylogenetic trees, Mol. Biol. Evol., 4, 406–25. size in Brassicaceae, Ann. Bot., 95, 229–35. 62. Willems, G., Drager, D. B., Courbot, M., Gode, C., Verbruggen, N. and 45. Kajitani, R., Toshimoto, K., Noguchi, H., et al. 2014, Efficient de novo as- Saumitou-Laprade, P. 2007, The genetic basis of zinc tolerance in the met- sembly of highly heterozygous genomes from whole-genome shotgun allophyte Arabidopsis halleri ssp. halleri (Brassicaceae): an analysis of short reads, Genome Res., 24, 1384–95. quantitative trait loci, Genetics, 176, 659–74. 46. Simpson, J. T., Wong, K., Jackman, S. D., Schein, J. E., Jones, S. J. and 63. Frerot, H., Faucon, M. P., Willems, G., et al. 2010, Genetic architecture Birol, I. 2009, ABySS: a parallel assembler for short read sequence data, of zinc hyperaccumulation in Arabidopsis halleri: the essential role of Genome Res., 19, 1117–23. QTL x environment interactions, New Phytol., 187, 355–67. 47. Luo, R., Liu, B., Xie, Y., et al. 2012, SOAPdenovo2: an empirically im- 64. Vishnoi, A., Kryazhimskiy, S., Bazykin, G. A., Hannenhalli, S. and proved memory-efficient short-read de novo assembler, Gigasci., 1, 18. Plotkin, J. B. 2010, Young proteins experience more variable selection 48. Wu, T. D. and Watanabe, C. K. 2005, GMAP: a genomic mapping and pressures than old proteins, Genome Res., 20, 1574–81. alignment program for mRNA and EST sequences, Bioinformatics, 21, 65. Jordan, I. K., Wolf, Y. I. and Koonin, E. V. 2004, Duplicated genes evolve 1859–75. slower than singletons despite the initial rate increase, BMC Evol. Biol., 4, 22. 49. Langdon, W. B. 2015, Performance of genetic programming optimised 66. Alba, M. M. and Castresana, J. 2005, Inverse relationship between evolu- Bowtie2 on genome comparison and analytic testing (GCAT) bench- tionary rate and age of mammalian genes, Mol. Biol. Evol., 22, 598–606. marks, BioData Min., 8,1. 67. Wolf, Y. I., Novichkov, P. S., Karev, G. P., Koonin, E. V. and Lipman, D. 50. Untergasser, A., Nijveen, H., Rao, X., Bisseling, T., Geurts, R. and J. 2009, The universal distribution of evolutionary rates of genes and dis- Leunissen, J. A. 2007, Primer3Plus, an enhanced web interface to tinct characteristics of eukaryotic genes of different apparent ages, Proc. Primer3, Nucleic Acids Res., 35, W71–4. Natl. Acad. Sci. U. S. A., 106, 7273–80. 51. Remm, M., Storm, C. E. and Sonnhammer, E. L. 2001, Automatic cluster- 68. Davis, J. C. and Petrov, D. A. 2004, Preferential duplication of conserved ing of orthologs and in-paralogs from pairwise species comparisons, proteins in eukaryotic genomes, PLoS Biol., 2, E55. J. Mol. Biol., 314, 1041–52. 69. Yang, J., Gu, Z. and Li, W. H. 2003, Rate of protein evolution versus fit- 52. Katoh, K., Misawa, K., Kuma, K. and Miyata, T. 2002, MAFFT: a novel ness effect of gene deletion, Mol. Biol. Evol., 20, 772–4. method for rapid multiple sequence alignment based on fast Fourier trans- 70. Satake, M., Kawata, M., McLysaght, A. and Makino, T. 2012, Evolution form, Nucleic Acids Res., 30, 3059–66. of vertebrate tissues driven by differential modes of gene duplication, 53. Yang, Z. 2007, PAML 4: phylogenetic analysis by maximum likelihood, DNA Res., 19, 305–16. Mol. Biol. Evol., 24, 1586–91. 71. Lan, X. and Pritchard, J. K. 2016, Coregulation of tandem duplicate genes 54. Lynch, M. and Conery, J. S. 2000, The evolutionary fate and conse- slows evolution of subfunctionalization in mammals, Science, 352, 1009–13. quences of duplicate genes, Science, 290, 1151–5. 72. Rensing, S. A., Lang, D., Zimmer, A. D., et al. 2008, The Physcomitrella 55. Bu, L. and Katju, V. 2015, Early evolutionary history and genomic fea- genome reveals evolutionary insights into the conquest of land by plants, tures of gene duplicates in the human genome, BMC Genomics, 16, 621. Science, 319, 64–9. 56. Gao, L. Z. and Innan, H. 2004, Very low gene duplication rate in the 73. Heckman, T. M. and Kauffmann, G. 2011, The coevolution of galaxies yeast genome, Science, 306, 1367–70. and supermassive black holes: a local perspective, Science, 333, 182–5. 57. Heredia, N. J., Belgrader, P., Wang, S., et al. 2013, Droplet Digital PCR 74. Tuskan, G. A., Difazio, S., Jansson, S., et al. 2006, The genome of black cot- quantitation of HER2 expression in FFPE breast cancer samples, tonwood, Populus trichocarpa (Torr. & Gray), Science, 313, 1596–604. Methods, 59, S20–3. 75. Wolfe, K. H., Gouy, M., Yang, Y. W., Sharp, P. M. and Li, W. H. 1989, 58. Drager, D. B., Desbrosses-Fonrouge, A. G., Krach, C., et al. 2004, Two Date of the monocot-dicot divergence estimated from chloroplast DNA genes encoding Arabidopsis halleri MTP1 metal transport proteins sequence data, Proc. Natl. Acad. Sci. U. S. A., 86, 6201–5. Downloaded from https://academic.oup.com/dnaresearch/advance-article-abstract/doi/10.1093/dnares/dsy005/4898128 by Ed 'DeepDyve' Gillespie user on 08 June 2018 http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png DNA Research Oxford University Press

Functional divergence of duplicate genes several million years after gene duplication in Arabidopsis

Free
13 pages

Loading next page...
 
/lp/ou_press/functional-divergence-of-duplicate-genes-several-million-years-after-dVWtZSo04N
Publisher
Oxford University Press
Copyright
© The Author 2018. Published by Oxford University Press on behalf of Kazusa DNA Research Institute.
ISSN
1340-2838
eISSN
1756-1663
D.O.I.
10.1093/dnares/dsy005
Publisher site
See Article on Publisher Site

Abstract

Lineage-specific duplicated genes likely contribute to the phenotypic divergence in closely re- lated species. However, neither the frequency of duplication events nor the degree of selection pressures immediately after gene duplication is clear in the speciation process. Here, using Illumina DNA-sequencing reads from Arabidopsis halleri, which has multiple closely related spe- cies with high-quality genome assemblies (A. thaliana and A. lyrata), we succeeded in generat- ing orthologous gene groups in Brassicaceae. The duplication frequency of retained genes in the Arabidopsis lineage was 10 times higher than the duplication frequency inferred by compara- tive genomics of Arabidopsis, poplar, rice and moss (Physcomitrella patens). The difference of duplication frequencies can be explained by a rapid decay of anciently duplicated genes. To ex- amine the degree of selection pressure on genes duplicated in either the A. halleri-lyrata or the A. halleri lineage, we examined positive and purifying selection in the A. halleri-lyrata and A. halleri lineages throughout the ratios of nonsynonymous to synonymous substitution rates (K / K ). Duplicate genes tended to have a higher proportion of positive selection compared with non-duplicated genes. Interestingly, we found that functional divergence of duplicated genes was accelerated several million years after gene duplication compared with immediately after gene duplication. Key words: plant evolution, gene duplication, Arabidopsis, selection pressure, functional divergence V C The Author 2018. Published by Oxford University Press on behalf of Kazusa DNA Research Institute. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com 1 Downloaded from https://academic.oup.com/dnaresearch/advance-article-abstract/doi/10.1093/dnares/dsy005/4898128 by Ed 'DeepDyve' Gillespie user on 08 June 2018 2 Functional divergence after gene duplication 1. Introduction collected the plants in Mt. Ibuki, Shiga, Japan. We then extracted the plant DNAs, and performed the paired-end Illumina DNA sequencing. Plant genomes have experienced more gene duplication events than After generating contigs from the Illumina DNA-sequencing reads, we most other eukaryotes. These duplications are largely classified into inferred A. halleri orthologous genes against the A. thaliana and whole genome duplications (WGDs) and small-scale duplications A. lyrata genes. The number of SSDs was inferred both from phyloge- (SSDs). WGD events are concentrated in the Cretaceous-Paleogene netic analysis of orthologous genes and the reading depth of Illumina extinction period. SSDs such as retroduplication and tandem dupli- DNA sequencing. As outgroup species for the three Arabidopsis species, cation have occurred continuously at a high rate during plant evolu- we used five non-Arabidopsis species in Brassicaceae for which genome tion, indicating that SSDs may be a key factor for plant speciation or 3–8 sequences are already determined. There is trans-specific polymorphism phenotypic differences in close relatives. SSDs tend to be lineage- in Arabidopsis; in particular, some genes have undergone gene flow in specific in land plants. There is a clear functional bias between 4,9 Arabidopsis. Here, excluding genes that have undergone gene flow in WGD and SSD in plants. SSDs tend to be associated with stress re- 4,8–10 Arabidopsis, we generated three sets of orthologous gene groups sponses, and are likely important for adaptive evolution to rap- (OGGs) among five non-Arabidopsis species, A. thaliana, A. lyrata and idly changing environments in close relatives. A. halleri in Brassicaceae (Fig. 1). Each OGG represent genes derived SSD genes functionally diverge through either neo-functionalization 11,12 from one ancestral gene in each speciation event. By identifying the or sub-functionalization, which may contribute to speciation. SSD genes in each OGG, we examined the evolutionary process in the genes can also retain functional redundancy through the long-term fate 13,14 15,16 Arabidopsis lineage. Although a BAC library-based genome for A. hal- of becoming pseudogenes, dosage selection or avoiding devel- 17–19 leri is unavailable, there is an available A. halleri genome based on opmental errors. Functional divergence of duplicated genes can be long-insert mate paired reads and short-insert paired-end reads in inferred by selection pressures based on the non-synonymous substitu- 38,39 Illumina DNA sequencing. The available A. halleri genome was tion rate (K ) versus the synonymous substitution rate (K ). Genes un- A S originated from A. halleri subsp. gemmifera which was the same sub- dergoing positive selection (K /K > 1) and purifying selection (K /K A S A S species in our plant material. The collection site was in Tada mine, < 1) are associated with functional divergence and constraint, respec- Hyogo, Japan. Microsatellite analyses suggested that our plant material tively. Most earlier studies compared selection pressures in WGDs collected from Mt. Ibuki, Shiga, Japan was genetically closed to the with those in SSDs, or selection pressures in anciently duplicated genes 21–24 plant with the available genome. Thus, it is likely that A. halleri genes (several hundred million years [MY] ago) in plants. There are little inferred from the available genome are similar to our inferred A. halleri data with which to examine selection pressures in recent SSDs in plants. genes. To validate our overall results, we also examined SSDs based on Because selection pressure likely varies during the speciation process, it the available A. halleri genome, is important to examine the selection pressures immediately after gene In a previous report, selection pressures of genes duplicated after duplications among close relatives. the divergence of A. thaliana and A. lyrata were examined in the A. Recently, there are many plant genomes assembled by Illumina thaliana lineage. These lineage-specific duplicated genes tend to DNA-sequencing reads. However, SSDs tend to be underestimated 26,27 have undergone positive selection. Here, we examined the selection in contigs generated by only Illumina DNA-sequencing. It is pressures in duplicated genes in the other lineage after the split be- likely that genomes generated by BAC are useful to examine SSDs. tween A. thaliana and A. lyrata. This lineage was further split into However, construction of such a library takes much budget and time the A. halleri–lyrata lineage and A. halleri lineage (Fig. 2). We then to generate highly qualified genomes. In the present study, we tried examined whether gene duplications in either the A. halleri–lyrata or to generate reliable orthologous genes by only paired-end Illumina A. halleri lineage contributed to functional divergence in the retained DNA-sequencing reads using available genomes of close relatives. duplicate genes. We then inferred SSDs in a species after the divergence of closed spe- Duplicated genes may have undergone purifying selection in the cies throughout the reading depth of Illumina DNA sequencing. Arabidopsis lineage. The gene dosage hypothesis proposes that du- The Arabidopsis lineage seems to be the best lineage to examine re- plicate genes can be fixed for increased gene dosage, keeping redun- cent SSDs in plants because Arabidopsis has two BAC library-based ge- 28,29 dant functions among duplicated genes. In A. halleri, the expression nomes in Arabidopsis thaliana and Arabidopsis lyrata. In of HEAVY METAL ATPASE 4 (HMA4) has been enhanced by cis- particular, A. thaliana has a large amount of functional data for its an- regulatory mutations and increased gene copy number for metal notated genes. To comprehensively examine SSDs in Arabidopsis,we in- hyperaccumulation. This is an example of increased gene dosage vestigated an additional Arabidopsis species—A. halleri. The divergence 30 by gene duplication. Consequently, duplicated genes with purifying time between A. lyrata and A. halleri is supposed to be 2MYA, and selection tend to be associated with increased gene dosage. the divergence time between either A. halleri or A. lyrata and A. thaliana 31 To understand the contribution of duplicated genes to the specia- is 10–11 MYA. Among recently diverged three Arabidopsis species, tion process in Arabidopsis within the last 10 MY, we first examined there are phenotypic variations such as self-compatibility and heavy- the frequency of gene duplications in the A. halleri–lyrata and A. hal- metal tolerance. A. thaliana has self-compatibility, but the others are 32–34 leri lineages. We then examined the relationship between gene dupli- self-incompatible. Therefore, A. halleri and A. lyrata have large pet- cation and selection pressure based on the K /K ratio. Finally, we A S als to attract pollinator insects, and the anthers are separated from the examined the over-represented functional categories of genes under stigma to avoid autopollination. A. halleri is known as a Zn/Cd positive and purifying selection. 33,34 hyper-accumulator andsome wildpopulations of A. lyrata are toler- ant of serpentine soils, which are characterized by a high heavy-metal content, while A. thaliana is not. 2. Materials and methods A. halleri has highly variable morphologies with respect to leaf, 2.1. Sampling and illumina paired-end flower colour and development of stolons among the subspecies. To ex- amine SSDs in the Arabidopsis lineage, we focused on A. halleri subsp. DNA-sequencing analysis gemmifera which grows in the Russian Far East, northeastern China, Arabidopsis halleri subsp. gemmifera is one of the three subspecies of Korea, Taiwan and Japan across lowland and highland areas. We A. halleri, and grows in the Russian Far East, northeastern China, Downloaded from https://academic.oup.com/dnaresearch/advance-article-abstract/doi/10.1093/dnares/dsy005/4898128 by Ed 'DeepDyve' Gillespie user on 08 June 2018 K. Hanada et al. 3 nonA-AT-AL-AH OGGs AT-AL-AH OGGs AL-AH OGGs B. rapa, B. stricta C. grandiflowa A. halleri (AH) A. thaliana (AT) A. lyrata (AL) C. rubella, E. salsugineum (non-A) Figure 1. Three sets of OGGs among non-Arabidopsis species, A. thaliana, A. lyrata and A. halleri. OGGs between A. lyrata and A. halleri were defined as AL–AH OGGs. There were 25,833, 26,428 and 26,007 AL–AH OGGs based on Illumina paired-end DNA-sequencing reads without any pseudogene-like genes, Illumina paired-end DNA-sequencing reads including pseudogene-like genes and the available A. halleri genome, respectively. OGGs among A. thaliana, A. lyrata and A. halleri were defined as AT–AL–AH OGGs. There were 22,105, 22,684 and 21,537 AT–AL–AH OGGs based on Illumina paired-end DNA-sequencing reads with- out any pseudogene-like genes, Illumina paired-end DNA-sequencing reads including pseudogene-like genes and the available A. halleri genome, respectively. OGGs among non-Arabidopsis species (B. rapa, B. stricta, C. grandiflora, C. rubella, E. salsugineum), A. thaliana, A. lyrata and A. halleri were defined as nonA- AT–AL–AH OGGs. There were 17,669 and 17,925 non-A-AT–AL–AH OGGs based on Illumina paired-end DNA-sequencing reads without any pseudogene-like genes and the available A. halleri genome, respectively. 450-700 mya -3 1.8-3.5 x 10 (gains/gene/my) 200 mya -3 2.0-2.5 x 10 (gains/gene/my) 100-120 mya -3 2.5-3.0 x 10 (gains/gene/my) 33-36 mya A. halleri-lyrata lineage 10-11mya -2 (gains/gene/my) 1.6-2.0 x 10 2 mya A. halleri lineage -2 5.8-7.6 x 10 (gains/gene/my) P. patens O.sativa P.trichocarpa B. rapa A. thaliana A.lyrata A.halleri Figure 2. Gain rates through gene duplication in land plants. The divergence times among moss (P. patens), rice (O. sativa), poplar (P. trichocarpa), B. rapa, 30,31,72–75 A. thaliana, A. lyrata and A. halleri were taken from previous reports. Gene gains through gene duplication were estimated for each branch. The gene gains in two branches were estimated in this study. One was the A. halleri-lyrata lineage, which represents the evolutionary process of the A. halleri lineage be- tween the divergence of A. thaliana and the divergence of A. lyrata. The other was the A. halleri lineage, which represents the evolutionary process of the A. hal- leri lineage after the divergence of A. lyrata. Gene gains were divided by ancestral gene numbers and branch lengths corresponding to times (MY). The rate of gene gain was defined as the number of genes gained through gene duplication per gene per MY. Korea, Taiwan and Japan. In 2008, a leaf sample was collected nuclear genome, sequencing reads mapped to either the mitochon- from an individual of A. halleri subsp. gemmifera on Mt. Ibuki, drial or chloroplast genome in A. thaliana (https://www.arabidopsis. Shiga, Japan. DNA was extracted from the leaf using a DNeasy org, TAIR10) or A. lyrata (http://genome.jgi.doe.gov, Filtered Plant Mini Kit (QIAGEN, Venlo, The Netherlands). A 300-bp Models6) by BLASTN version 2.2.29 (e-value < 1.0) were excluded paired-end DNA library was constructed according to the paired- from the following procedures. Assuming that the genome size of End Genomic DNA Sample Preparation Kit (Illumina, San Diego, A. halleri was 255 Mb, the sequencing coverage was estimated to CA, USA), and 93-bp paired-end reads were obtained from the be 135 (34.4/0.255 Gb). Illumina GAIIx sequencer. A total of 44.5 Gb reads were generated by Illumina DNA paired- 2.2. Gene sequences of Brassica rapa, Boechera stricta, end sequencing. Using the Trim Galore (www.bioinformatics.babra Capsella grandiflora, Capsella rubella, Eutrema ham.ac.uk) software, sequences with low-quality base calls (Phred score < 20) were trimmed off from the 3 end of the reads. Adapter salsugineum, A. thaliana and A. lyrata sequences started with ‘AGATCGGAAGAGC’ were completely re- Gene sequences in B. rapa (BrapaFPsc_277_v1.3), B. stricta moved from the reads. Approximately 28% of reads had either low- (Bstricta_278_v1.2), C. grandiflora (Cgrandiflora_266_v1.1), C. ru- quality scores or adapters and were trimmed by Trim Galore. When bella (Crubella_183_v1.0) and E. salsugineum (Esalsugineum a paired-end read was completely removed by this procedure, the _173_v1.0) were collected from Phytozome (https://phytozome.jgi. other read was used as a single-end read. Given that mitochondrial doe.gov, version 10). Gene sequences in A. thaliana and A. lyrata and chloroplast genomes have much higher copy numbers than the were collected from TAIR (https://www.arabidopsis.org/, version 10, Downloaded from https://academic.oup.com/dnaresearch/advance-article-abstract/doi/10.1093/dnares/dsy005/4898128 by Ed 'DeepDyve' Gillespie user on 08 June 2018 4 Functional divergence after gene duplication TAIR10_cds_20110103) and JGI (http://genome.jgi.doe.gov, mapped by BOWTIE2 version 2.2.3. Reads per kilobase of exon FilteredModels6), respectively. model per million (RPKM) values were calculated for each A. halleri gene. For 12 A. halleri genes whose copy numbers were known (Supplementary Table S2), we designed primer pairs using the fol- 2.3. Assembly of A. halleri genes orthologous to lowing parameters in the Primer3Plus software: primer length of A. lyrata genes based on illumina paired-end 18–24 bases, amplicon length of 70–150 bp, T value of 57–63 C DNA-sequencing reads and GC composition of 40–60%. To obtain enough genomic DNA We first generated A. halleri contigs from a total of 44.5 Gb Illumina for droplet digital PCR (ddPCR), we mixed the genomic DNAs from DNA-sequencing reads using several assembly software tools such as four individuals of A. halleri from Mt. Ibuki. The genomic DNA was 45 46 47 Platanus 1.2.4, ABySS 1.5.2, SOAP-denovo2 and CLC sonicated with the Covaris M220 system (25 s in frequency sweeping Genomics Workbench 7.0.3 (https://www.qiagenbioinformatics. mode). The concentration of the sonicated genomic DNA sample com/) software. A higher proportion of genes in closely related spe- was 6 ng/ll. The peak size of sonicated DNA fragments was 2,382 cies tend to be mapped to reliable assembled sequences. As the clos- bp according to the Fragment Analyzer system (Advanced est species, 32,670 annotated A. lyrata genes were mapped to each Analytical, Ankeny, IA, USA) with the High Sensitivity Genomic type of assembled DNA segment by the gmap (version 2014-08-20) DNA Analysis Kit (Advanced Analytical). Each ddPCR reaction was software with default parameters, which takes into account exon–in- performed with ddPCR EvaGreen Supermix (Bio-Rad, Hercules, CA, tron junctions. The A. halleri contigs generated by the ABySS soft- USA). Each reagent was divided into 20,000 droplets with a drop- ware with 63 K-mer size (N50 size ¼ 5,331 bp) had the highest let generator (Bio-Rad QX-200). Cycled droplets were measured mapping rate to the 32,670 annotated A. lyrata genes with >80% with a QX200 droplet reader (Bio-Rad). coverage (Supplementary Table S1). Some of the matched regions To find A. halleri genes whose DNA concentrations were robustly were truncated by terminal codons. When coding sequences up to inferred by ddPCR, we first identified uniquely mapped regions (>80 the terminal codon had >80% similarity to and 80% coverage for bp) in A. halleri genes from the Illumina DNA-sequencing reads. an A. lyrata gene, the coding sequence was defined as the ortholo- Among the A. halleri genes with uniquely mapped regions, we manu- gous A. halleri gene sequence to an A. lyrata gene. Following this ally chose 50 A. halleri genes with a variety of RPKMs. We designed procedure, we generated 22,727 orthologous gene pairs between A. a pair of primers for each gene in the Primer3Plus software using the halleri and A. lyrata. However, an A. lyrata gene sequence was fre- parameters described earlier. To examine the specificity of the tar- quently mapped to different A. halleri contigs depending on the re- geted DNAs, we performed real-time PCR analysis using the proto- gion of the A. lyrata gene because the whole sequence of each A. col for the Mx3000P qPCR System (Agilent Technologies). The PCR halleri gene was split into several contigs. To further identify A. hal- analyses were performed using SsoFast EvaGreen Supermix (Bio- leri genes orthologous to A. lyrata genes, we re-assembled the A. hal- Rad) and the products were analysed using the Mx3000P multiplex leri contigs based on the mapping information of A. lyrata genes. To quantitative PCR system (Agilent Technologies). Specific and multi- collect additional A. halleri genes orthologous to A. lyrata, ple reactions should result in a single and multiple melting peaks cor- unmapped A. lyrata genes were re-mapped to the assembled DNA responding to the PCR products. Of the 50 primer pairs, 25 showed 5 43 segments by TBLASTN version 2.2.29 (e-value < 1  10 ). a clear raised curve for all three replicates. Thus, for the copy number However, the contigs mapped separately by A. lyrata genes may not assay by ddPCR, we used 13 primer pairs for unknown copy number be orthologous syntenic regions. To focus on only syntenic contigs genes, 10 pairs for single-copy genes in a broad range of plant species between A. halleri and A. lyrata, we defined syntenic contigs to and 2 pairs for a three-copy gene (Supplementary Table S2). which more than two A. lyrata genes within less than five spacer genes in A. lyrata genome were mapped. When an A. lyrata gene was separately mapped to two syntenic contigs between A. lyrata and A. 2.5. Assembly of A. halleri genes orthologous to A. halleri, the contigs were concatenated following the direction of the lyrata genes based on the available draft A. halleri A. lyrata gene. The A. lyrata gene was then mapped to the concaten- genome ated contigs. We thus obtained an additional 3,106 A. halleri gene Coding genes of A. halleri were generated on the draft A. halleri ge- sequences with >80% similarity and 80% coverage against A. lyrata nome using long-insert mate paired reads and short-insert paired-end genes. Finally, we succeeded in identifying 25,833 (79.1%) A. halleri reads in Illumina DNA sequencing. On the genome, we used anno- genes orthologous to 32,670 A. lyrata genes. Given that all identified tated coding genes. We performed BLAST search between the anno- A. halleri genes were paired with A. lyrata genes, a total of 25,833 tated coding A. halleri and A. lyrata genes by BLASTP. The best pairs were defined as A. lyrata–A. halleri (AL–AH) OGGs. matching pairs were defined to be AL–AH OGGs with more than In the above procedures, we did not identify pseudogene-like A. >80% similarity to and 80% coverage at amino acid level. We suc- halleri genes which are orthologous to A. lyrata genes. Consequently, ceeded in identifying 26,007 (79.6%) A. halleri genes which were duplication frequency may be underestimated in either the A. lyrata– orthologous to 32,670 A. lyrata genes. Out of 26,007 orthologous A. halleri or the A. halleri lineage. Therefore, we also identified pairs of A. halleri and A. lyrata genes, 22,105 (22,105/26,007 ¼ 26,428 orthologous A. halleri regions truncated by terminal codons 85%) A. lyrata genes were inferred in the AL–AH OGGs based on with >80% similarity and 80% coverage against A. lyrate genes. only Illumina paired-end DNA-sequencing reads. The number of spe- These analysis procedures and findings are summarized in cifically identified A. halleri genes based on Illumina paired-end Supplementary Figure S1. DNA sequencing and the available A. helleri genome was 3,728 and 3,902, respectively (Supplementary Fig. S2). Thus, either method had 2.4. Validation of A. halleri lineage-specific duplicated a particular advantage to infer the orthologous genes. genes by droplet digital PCR A. halleri genes duplicated in the A. lyrata lineage were identified To calculate the read coverage of each A. halleri gene, the mapped with the Inparanoid algorithm Given the best-match pair, A1 and count was divided by the number of genes to which a read was B1, an A. halleri gene that had a smaller sequence distance to A1 Downloaded from https://academic.oup.com/dnaresearch/advance-article-abstract/doi/10.1093/dnares/dsy005/4898128 by Ed 'DeepDyve' Gillespie user on 08 June 2018 K. Hanada et al. 5 (or B1) than the distance between A1 and B1, was added to the AL– (GO) assignments for the A. thaliana genes from The Arabidopsis AH OGG containing A1 and B1 (seeds). This process was continued Information Resource (www.arabidopsis.org). Among the top three until all qualified A. halleri genes were assigned to seed pairs of the GO categories (cellular components, molecular functions, and biologi- genes. cal processes), we analysed only biological processes. For each GO cat- egory in the biological processes category, the expected ratio was inferred to be the ratio between the number of genes assigned to the 2.6. Selection pressures in the A. halleri-lyrata and GO category and the number of genes not assigned to the GO category A. halleri lineages among all annotated A. thaliana genes. In each GO category, the ob- To infer selection pressure in the A. halleri–lyrata lineage, we fo- served ratio in each category of OGGs was compared with the ex- cussed on orthologous groups that followed the speciation process pected ratio by a v test for different categories of OGGs. The ratios among the genes of the five non-Arabidopsis species (B. rapa, were processed in the R software environment (www.r-project.org) B. stricta, C. grandiflora, C. rubella, E. salsugineum), A. thaliana, and normalized among different arrays using Z-scores calculated using A. lyrata and A. halleri. In each orthologous group, a multiple align- genescale in the R library. A heatmap was generated using heatmap.2 ment was made to match the coding regions with the computer pro- in the R library. To correct for multiple testing, the false discovery rate gramme MAFFT v7.215 with default parameters. Using the (FDR) was estimated by the R-library Q-VALUE software. The null phylogenetic tree and multiple alignment, we constructed the com- hypothesis was rejected if FDR values were < 0.01. mon ancestral sequences among A. thaliana, A. halleri and A. lyrata, and the common ancestral sequence between A. halleri and A. lyrata with codeml (model ¼ 1, NSsites ¼ 0) in PAML. For each pair of an- 3. Results and discussion cestral sequences, the synonymous (K ) and non-synonymous substi- tution rates (K ) were calculated using yn00 (code ¼ 0, weighting ¼ 3.1. Duplication frequency in the A. halleri-lyrata 0, commonf3x4 ¼ 0) in PAML. To determine whether the K /K A S lineage ratio was significantly <1, two maximum likelihood values were cal- To examine the frequency of duplication events in the A. halleri-lyrata culated with the K /K ratio fixed at 1 and with the K /K ratio as a A S A S lineage (Fig. 2), we first used 25,833 AL–AH OGGs based on Illumina free parameter. The ratio of the maximum likelihood values was paired-end DNA-sequencing reads. We searched for an orthologous then compared with the v distribution. A P-value representing the A. thaliana gene for each AL–AH OGG by BLASTP searches between deviation of the two models was then calculated for the A. halleri- AL and AH OGG protein sequences and annotated A. thaliana pro- lyrata lineage. tein sequences. When both the A. lyrata and A. halleri genes in an To infer selection pressure in the A. halleri lineage, we generated a AL–AH OGG had the best hit to the same A. thaliana gene, the A. tree file for the A. thaliana, A. lyrata and A. halleri genes. In each of thaliana gene was considered an orthologous candidate gene. Of the orthologous groups, a multiple alignment was made to match the 25,833 AL–AH OGGs, 93.3% (24,019/25,833) had orthologous can- coding regions by the computer programme MAFFT. Using the tree didate genes in A. thaliana. Among these, we searched for orthologous file and multiple alignment file, we constructed the common ancestral groups that were consistent with the species tree. To do this, the syn- sequences among A. thaliana, A. halleri,and A. lyrata with codeml onymous substitution rates (K ) were calculated among the A. thali- (model ¼ 1, NSsites ¼ 0) in PAML. Although we used a representa- ana, A. lyrata,and A. halleri genes in each orthologous group using tive A. halleri gene sequence to infer the ancestral sequence, proper A. yn00 in PAML. When the K between A. lyrata and A. halleri was halleri genes had sequence variation when they were lineage-specific lower than both the K between A. thaliana and A. lyrata and the K S S duplicated after the split from A. lyrata. From the Illumina DNA-se- between A. thaliana and A. halleri, we assumed that the topology of quencing reads mapped to A. halleri genes by BOWTIE2 version the orthologous group was consistent with the species tree. Among the 2.2.3 with default parameters, the sequence variation was examined 24,019 AL–AH OGGs, 22,602 (22,602/24,019 ¼ 94.1%) had the in each A. halleri gene by the Picard software with default parameters same topology as the species tree. In the 22,602 AL–AH OGGs, (http://broadinstitute.github.io/picard/). When a variable sequence 16,704 and 2,370 A. thaliana genes were uniquely and multiply as- was observed in the A. halleri genes, the codon sequence including the signed to AL–AH OGGs, respectively. To examine the duplication fre- variable nucleotide was concatenated into the representative A. halleri quency in A. halleri-lyrata lineage, we considered two evolutionary genes. Alignments between A. halleri genes including codons with scenarios for the 2,370 A. thaliana genes multiply assigned to 5,898 variable nucleotides and the ancestral sequence were modified by add- OGGs. One is that gene duplication events occurred in the A. halleri- ing codons of the ancestral sequence against concatenated codons in- lyrata lineage. That is, the assigned A. thaliana gene is orthologous to cluding variable nucleotides. K and K were calculated in each pair A S the AL–AH OGGs. The other is that gene loss events occurred in the of ancestral and A. halleri gene sequences including variable se- A. thaliana lineage after gene duplication in the common ancestor quences using yn00 in PAML. To determine whether the K /K ratio A S among A. thaliana, A. lyrata and A. halleri. In this case, the assigned was significantly <1, two maximum likelihood values were calculated A. thaliana gene is not orthologous to the AL–AH OGGs. To examine with the K /K ratio fixed at 1 and with the K /K ratio as a free pa- A S A S these two possible scenarios, we collected AL–AH OGGs to which an rameter using codeml in PAML. The ratio of the maximum likelihood A. thaliana gene was assigned as an orthologous gene. We then calcu- values was then compared with the v distribution. A P-value repre- lated the K between the A. thaliana gene and AL–AH OGG, and senting the deviation of the two models was then calculated for each searched for the closest pair of the AL–AH OGG against the A. thali- A. halleri gene. These analysis procedures are summarized in ana gene. When the K between the chosen AL-AH OGG and the Supplementary Figure S3. other AL-AH OGG was lower than the K in the closest pair, the other AL–AH OGG was defined as sharing the A. thaliana gene as an 2.7. Inference of over-represented gene ontologies orthologous gene. Of the 5,898 AL–AH OGGs, 497 had no ortholo- Assuming that A. halleri and A. lyrata genes have similar GO assign- gous A. thaliana genes and 5,401 had 2,370 orthologous A. thaliana ments in A. thaliana in the same OGG, we obtained gene ontology genes. In the following analyses, 22,105 (16,704 þ 5,401) AL–AH Downloaded from https://academic.oup.com/dnaresearch/advance-article-abstract/doi/10.1093/dnares/dsy005/4898128 by Ed 'DeepDyve' Gillespie user on 08 June 2018 6 Functional divergence after gene duplication OGG with 19,074 (16,704 þ 2,370) orthologous A. thaliana genes patens), rice (Oryza sativa) and poplar (Populus trichocarpa). The were defined as A. thaliana–A. lyrata–A. halleri (AT–AL–AH) OGGs gain rates were 1.8–3.5  10 in the three branches, 10 times (Supplementary Table S3). These analysis procedures are summarized lower than in the A. halleri-lyrata lineage (Fig. 2). One explanation in Supplementary Figure S4. for this gain rate is that even though many genes were fixed and re- In A. lyrata-halleri lineage, 22,105 AT–AL–AH OGGs were de- tained, a large number of them did not survive in the long run. This rived from the 19,074 genes, which represent the number of genes in explanation is consistent with the gradual decay of paralog synony- the common ancestor of A. thaliana, A. lyrata and A. halleri. Thus, mous substitution rates observed in several eukaryotes over 54,55 3,031 (22,10519,074) gains through gene duplication were sup- time. posed to have occurred in the A. halleri-lyrata lineage. The gain rate Note that the effect of tandemly duplicated genes on these gain (total gene duplication gains during a given time period divided by rates needs to be considered because tandemly duplicated genes have the estimated duration per gene) was inferred to be 1.8–2.0  10 undergone gene conversion, which leads to identical sequences (3,031 gains/19,074 genes/8–9 MY) in the A. halleri-lyrata lineage. among tandemly duplicated genes in the same species. Additionally, we examined 26,428 AL–AH OGGs including Consequently, our procedure may have failed to identify some pseudogene-like A. halleri genes (Supplementary Table S4), and iden- orthologous A. halleri genes among the A. lyrata tandemly dupli- tified 22,684 AT–AL–AH OGGs derived from the 19,318 ortholo- cated genes. In such cases, our method would have missed the dupli- gous A. thaliana genes. The gain rate was inferred to be 1.9–2.2  cation event in the A. halleri-lyrata lineage, meaning our gain rate is 10 (3,366 gains/19,318 genes/8–9 MY) in the A. halleri-lyrata line- underestimated in the A. lyrata-halleri lineage. Thus, tandemly dupli- age. The gain rate in OGGs including pseudogene-like genes cated genes did not disturb the trend of a higher gain rate in A. hal- (Fig. 4C) is similar to OGGs without pseudogene-like genes, indicat- leri-lyrata in comparison with other land plants. ing that addition of pseudogene-like A. halleri genes did not affect duplication frequency in the A. halleri-lyrata lineage. Second, we used 26,007 AL–AH OGGs based on the available A. 3.2. Duplication frequency in the A. halleri lineage halleri genome. Of 26,007 AL–AH OGGs, 91.0% (23,682/26,007) The copy numbers of the A. halleri genes were unclear from AT– had orthologous candidate genes in A. thaliana (Supplementary AL–AH OGGs. However, the absolute copy number of an A. halleri Table S5). Among the 23,682 AL–AH OGGs, 21,984 (21,984/ gene could be experimentally inferred by the absolute DNA concen- 23,682 ¼ 92.8%) had the same topology as the species tree. In the tration of the gene by ddPCR. First, to examine the relationship be- 21,984 AL–AH OGGs, 16,800 and 2,058 A. thaliana genes were tween copy number and DNA concentration, we focussed on known 41 58 uniquely and multiply assigned to AL–AH OGGs, respectively. Since three-copy genes, HMA4 and MTP1, and 10 singleton genes that 2,058 A. thaliana genes multiply assigned to 5,183 OGGs, 5,183 share a single copy in a broad range of plants. The concentration OGGs were classified into 446 OGGs which had no orthologous of HMA4 was 627.5 6 40.39 and 626.75 6 39.38 copies/ll (four A. thaliana genes and 4,737 OGGs which had 2,058 orthologous replicates for each of the two primer pairs, average of each primer A. thaliana genes. Following the above procedure, 21,537 (16,800 þ pair 6 95% CI). The concentration of MTP1 was 605.0 6 22.39. 4,737) AL–AH OGG with 18,858 (16,800 þ 2,058) orthologous Conversely, the concentration of the 10 singleton genes was 193.30 A. thaliana genes were defined as AT–AL–AH OGGs. The gain rate 6 9.06 copies/ll (four replicates for each of the 10 primer pairs, av- was inferred to be 1.6–1.8  10 (2,679 gains/18,858 genes/8–9 erage of all 10 primer pairs 6 95% CI). The average DNA concen- MY) in the A. halleri-lyrata lineage. Taken together, the gain rate tration was 3.2 times higher for the three-copy genes compared without pseudogene-like A. halleri genes was 1.6–2.0  10 in the with the single-copy genes (Fig. 3A), indicating that the copy number A. halleri-lyrata lineage (Fig. 2). corresponded to the DNA concentration inferred by ddPCR. In our previous study, we inferred the gain rate of gene duplica- The copy numbers of the A. halleri genes could also be inferred by tion in the lineage leading to Arabidopsis after the divergence of the reading depth of the Illumina DNA-sequencing reads. The read- moss. Using the same method, we re-estimated the gain rates in ing depth was defined as the RPKM. To examine whether RPKM three times periods—after the divergence of moss (Physcomitrella values corresponded with copy numbers, we focussed on 25 AB y = 0.7991x 500 1200 0 500 1000 1500 2000 2500 3-copied genes Single copied genes RPKM Figure 3. Relationship between the Illumina DNA-sequencing read depth and the copy number inferred by ddPCR. (A) The Y-axis represents copy numbers per ll inferred by ddPCR. Black circles (gray background) and open circles (black background) indicate three-copy genes and single-copy genes, respectively. All points and error bars represent averages of four replicates and 95% CIs. (B) Each dot represents an A. halleri gene. The X-axis represents the Illumina DNA-sequencing read depth, which is the number of reads per 1 Kbp per 1 million reads. The Y-axis represents copy numbers per ll, which were inferred by ddPCR. The regression line was calculated with the simple formula Y ¼ aX; a was inferred by the least squares method. Downloaded from https://academic.oup.com/dnaresearch/advance-article-abstract/doi/10.1093/dnares/dsy005/4898128 by Ed 'DeepDyve' Gillespie user on 08 June 2018 Concentrations (Copies/µl) Concentrations (Copies/µl) K. Hanada et al. 7 A. halleri genes that had a wide variety of RPKMs (see section 2). genes, we fosused on 1,159 AT–AL–AH OGGs composing only For these genes, the copy number estimated by digital droplet PCR pseudogene-like A. halleri genes among 22,684 AT–AL–AH OGGs (ddPCR) was compared with the RPKM. Consequently, we found including pseudogene-like genes. We then identified 713 pseudogene- that RPKM was significantly correlated with concentration (Fig. 3B, like A. halleri genes experiencing gene duplications in either the R ¼ 0.94). This result indicated that RPKM values could be used an A. halleri-lyrata or the A. halleri lineage. In the case of 22,105 index of copy numbers for A. halleri genes. AT–AL–AH OGGs without any pseudogene-like A. halleri genes, we To infer duplication frequencies by RPKM values, we focussed on identified 6,752 A. halleri genes experiencing gene duplications in ei- 285 AT–AL–AH OGGs with single-copy genes in a broad range of ther the A. halleri-lyrata or the A. halleri lineage. The proportion plant species such as Arabidopsis, Carica, Populus, Vitis, Oryza, (713/1,159 ¼ 62%) of duplicated genes in pseudogene-like A. halleri Selaginella and Physcomitrella. The RPKM values of the A. halleri genes was significantly higher than that (6,752/22,105 ¼ 31%) of genes showed a normal distribution (Supplementary Fig. S5A). To duplicated genes in the other A. halleri genes (P-value ¼ 1.9 107 2 call duplicated genes, the top 5% (309) of RPKMs was defined as the 10 by v test, Table 1), indicating that most of pseudogenes threshold for duplicated genes. That is, A. halleri genes with an tended to appear via gene duplication in Arabidopsis. Taken to- RPKM < 309 were defined as non-duplicated genes, A. halleri genes gether, it is likely that most of recently duplicated genes in with RPKM values from 309 to 618 (309  2) were defined as two- Arabidopsis may be on the process of decayed genes but some of du- copy genes, and A. halleri genes with RPKM values of 618–927 (309 plicated genes significantly contributed to functional divergence 3) were defined as three-copy genes. Following this rule, in 22,105 among Arabidopsis species. AT–AL–AH OGGs based on Illumina paired-end DNA-sequencing Using 21,537 AT–AL–AH OGGs based on the available A. halleri reads, we identified 2,488 multiply duplicated A. halleri genes and genome, we identified 1,248 gain events in 838 genes 3,378 gain events through gene duplications in the A. halleri lineage (Supplementary Table S5). The gain rate was 2.9  10 (1,248 (Supplementary Table S3). The gain rate (total gains from gene du- gains/21,537/2MY). The gain rate was approximately half in com- plication during a given time period divided by the estimated dura- parison with the gain rate (5.8–7.6  10 ) inferred by RPKMs. tion per gene) was 7.6  10 (3,378 gains/22,105 genes/2 MY). Therefore, copy number information inferred by ddPCR was com- However, the gain events may have been over-estimated because pared with the number of A. halleri lineage-specific duplicated genes. duplication events occurring in the A. halleri-lyrata lineage caused an Most of OGGs had single-copy genes in A. halleri although various increase of RPKMs. Therefore, excluding A. halleri genes that were duplication frequencies were inferred by ddPCR (Supplemental duplicated in the A. halleri-lyrata lineage, we counted the number of Fig. S5B). These results indicate that some of gene duplication tended gain events in the A. halleri lineage. Among the currently retained A. to be missed in a genome assembly based only on Illumina short halleri genes, we focussed on 16,946 A. halleri genes that had not un- reads. This shows that the reading depth of raw Illumina reads may dergone gene duplication in the A.halleri-lyrata lineage. These be informative for inferring the number of recently duplicated genes. 16,946 genes had undergone 1,958 gain events through gene duplication in the A. halleri lineage. The re-estimated gain rate was 3.3. Selection pressures in the A. halleri-lyrata lineage 5.8  10 (1,958 gains/16,946 genes/2 MY). The gain rate (5.8–7.6 10 ) of duplicate genes in the A. halleri lineage was approxi- We found that 23–30% of OGGs had undergone gene duplications mately four times higher than the gain rate (1.6–2.0  10 ) in the in the Arabidopsis lineage at high gain rates (1.6–7.6  10 gains/ A. halleri-lyrata lineage (Fig. 2). As mentioned earlier, the gain rate gene/MY), which were 10 times higher than the rates inferred by difference can be explained by a rapid decay of duplicated genes. comparative genomics among Arabidopsis, poplar, rice and moss However, this gain rate did not include gene duplications derived (Fig. 2). Therefore, we were interested in investigating the functional from pseudogenes which were on the earlier process of decayed divergence of duplicated genes in Arabidopsis. To infer the func- genes. To determine the decayed effect in the A. halleri lineage, we tional divergence of duplicated genes in the A. halleri-lyrata lineage, focussed on 22,684 AT–AL–AH OGGs including pseudogene-like we tried to infer the ancestral sequences of the most recent common genes (Supplementary Table S4). Out of 22,684 OGGs, the gain rate ancestor of A. thaliana, A. lyrata and A. halleri. To define the node in the A. halleri lineage was 1.2  10 (5,665 gains/22,684 genes/2 of the most recent common ancestor among A. thaliana, A. lyrata, MY). Thus, the gain rate including pseudogenes-like A. halleri genes and A. halleri, we used an orthologous non-Arabidopsis gene as an was approximately two times higher than the gain rate (7.6  10 ) outgroup sequence for each of AT–AL–AH OGGs and performed without any pseudogene-like A. halleri genes. This result indicates BLASTP searches of AT–AL–AH protein sequences against all five that pseudogene-like A. halleri genes accelerated the gain rates in the non-Arabidopsis (B. rapa, B. stricta, C. grandiflora, C. rubella, A. halleri lineage. To determine whether pseudogenes tended to have E. salsugineum) protein sequences. When the best-hit non- a higher duplication rate in comparison with non-pseudogene-like Arabidopsis gene was the same for the A. thaliana, A. lyrata and Table 1. Comparison of gene duplication to non-gene duplication ratio in either A. halleri-lyrata or A. halleri lineage between OGGs including pseudogene-like A. halleri genes and OGGs without any pseudogene-like A. halleri genes Gene duplication No gene duplication D/N ratio P value (v test) in either A. halleri- in either A. halleri- lyrata or A. halleri lyrata or A. halleri lineage (D) lineage (N) OGGs derived from only pseudogenes 713 446 1.60 1.9  10 OGGs without any pseudogenes 6,752 15,353 0.44 Downloaded from https://academic.oup.com/dnaresearch/advance-article-abstract/doi/10.1093/dnares/dsy005/4898128 by Ed 'DeepDyve' Gillespie user on 08 June 2018 8 Functional divergence after gene duplication A. halleri genes, the non-Arabidopsis gene was considered a candi- lineage, the proportions of positive selection and purifying selection date orthologous gene to the AT–AL–AH OGG. For each of candi- in 2,488 duplicated genes were compared with those in 19,617 date genes, we searched for candidate orthologous groups that were non-duplicated genes. In this test, the null model was the ratio of du- consistent with the species tree. To do this, we generated a phyloge- plicated genes to non-duplicated genes in the A. halleri lineage. As netic tree by the neighbour-joining method using the PAUP software observed in the A. halleri-lyrata lineage, the proportions of positive 60,61 (set outroot ¼ mono, dset distance ¼ hky). When the topology selection in the duplicated genes were significantly higher than those of the gene tree was the same as that of the species tree, we assumed in the non-duplicated genes (P-value ¼ 2.3  10 for positive selec- that the topology of the orthologous group was consistent with the tion by v test, Fig. 4C), while the proportion of purifying selection species tree. There were 22,105 and 21,537 AT–AL–AH OGGs in the non-duplicated genes was significantly higher than in the du- 26 2 based on Illumina paired-end DNA-sequencing reads and the avail- plicated genes (P-value ¼ 1.9  10 by v test, Fig. 4C). We did able A. halleri genomes, respectively. Out of 22,105 and 21,537 the same analysis in 21,537 AT–AL–AH OOGs based on the avail- AL–AH OGGs, we identified 17,669 and 17,925 orthologous groups able A. halleri genome (Supplementary Table S5). We found the that followed the speciation process among non-Arabidopsis, same trend in the OOGs (P-value ¼ 9.9  10 for positive selec- 18 2 A. thaliana, A. lyrata, and A. halleri genes, respectively tion, P-value ¼ 2.2  10 for purifying selection by v test, (Supplementary Table S3). These OGGs were defined as non-A-AT– Fig. 4D). These results indicate that gene duplication induced func- AL–AH OGGs. These analysis procedures are summarized in tional divergence in the A. halleri lineage as well. Supplementary Figure S6. Functional divergence may not occur immediately after gene du- Based on the phylogenetic tree in each non-A-AT–AL–AH OGG, plication. To determine whether gene duplication in the A. halleri- we inferred the ancestor sequences of all nodes using codeml in the lyrata lineage affected selection pressures in the A. halleri lineage, we PAML package, and calculated K and K in the A. halleri-lyrata classified the 22,105 AT–AL–AH OGGs based on Illumina paired- A S lineage. First, among the 17,669 OGGs base on Illumina paired-end end DNA-sequencing reads into 5,159 OGGs with gene duplications DNA-sequencing reads, we found 481, 8,313 and 8,875 genes with and 16,946 OGGs without any gene duplications, focusing on the positive selection (K /K > 1), purifying selection (K /K < 1) and A. halleri-lyrata lineage. The proportions of positive selection and A S A S unclear selection respectively, by the maximum likelihood approach purifying selection in the duplicated genes in the A. halleri lineage (Supplementary Table S3). To address whether the duplicated genes were compared with those in the non-duplicated genes. In this test, contributed to functional divergence in the A. halleri-lyrata lineage, the null model was the ratio of duplicated genes to non-duplicated the proportions of positive selection and purifying selection in 1,782 genes in the A. halleri-lyrata lineage. The proportion of positive se- duplicated genes were compared with those in 15,887 non- lection in the duplicated genes was significantly higher than those in duplicated genes. In this test, the null model was the ratio of dupli- the non-duplicated genes (P-value ¼ 1.3  10 for positive selec- cated genes to non-duplicated genes without any particular selection tion by v test, Fig. 4E). The proportion of purifying selection in the pressure in the A. halleri-lyrata lineage. The proportions of positive non-duplicated genes was significantly higher than in the duplicated 85 2 selection in the duplicated genes was significantly higher than those genes (P-value ¼ 1.9  10 by v test, Fig. 4E). We did the same 57 2 in the non-duplicated genes (P-value ¼ 2.8  10 by v test, analysis in 21,537 based on the available A. halleri genome. We Fig. 4A). The proportion of purifying selection in the non-duplicated again found the same trend (P-value ¼ 4.8  10 for positive selec- 39 2 genes was significantly higher than in the duplicated genes in the tion, P-value ¼ 1.1  10 for purifying selection by v test, 33 2 A. halleri-lyrata lineage (P-value ¼ 1.2  10 by v test, Fig. 4A). Fig. 4F). These results indicate that gene duplication in the A. halleri- Furthermore, we did the same analysis in 17,925 OOGs based on lyrata lineage contributed to functional divergence in the A. halleri the available A. halleri genome (Supplementary Table S5). We found lineage. the same trend for only positive selection in the OOGs (P-value ¼ To determine whether gene duplications in the A. halleri-lyrata or 3.0  10 for positive selection, P-value ¼ 0.05 for purifying selec- A. halleri lineage contributed to functional divergence in the A. hal- tion by v test, Fig. 4B). These results indicate that gene duplication leri lineage, we focussed on two categories of OGGs—genes not du- induced functional divergence in the A. halleri-lyrata lineage. plicated in the A. halleri-lyrata lineage but duplicated in the A. halleri lineage (1,593 OGGs), and genes duplicated in the A. halleri- lyrata lineage but not duplicated in the A. halleri lineage (4,264 3.4. Selection pressure in the A. halleri lineage OGGs). The proportions of positive selection and purifying selection To infer selection pressure in the A. halleri lineage, we generated an- in the A. halleri lineage were compared in the two categories. In this cestral sequences of the A. lyrata and A. halleri genes using A. thali- test, the null model was the ratio of the two categories of OGGs ana genes as outgroups. When an A. halleri gene did not have any (1,593 and 4,264 OGGs). The proportions of positive selection in sequence variation in the Illumina DNA-sequencing reads, K and genes duplicated only in the A. halleri-lyrata lineage were signifi- K were simply calculated by comparing the ancestral sequence to cantly higher than in genes duplicated only in the A. halleri lineage 9 2 the representative A. halleri gene sequence. When sequence variation (P-value ¼ 7.3  10 for positive selection by v test, Fig. 4G). for a representative A. halleri gene sequence was identified from the Conversely, the proportion of purifying selection in genes duplicated Illumina DNA-sequencing reads, K and K were separately calcu- only in the A. halleri-lyrata lineage was significantly lower than in S A lated for the varied sequences (see Materials and Methods, genes duplicated only in the A. halleri lineage (P-value ¼ 2.9  10 Supplementary Fig. S4). Among the 22,105 AT–AL–AH OGGs by v test, Fig. 4G). We did the same analysis in 21,537 based on based on Illumina paired-end DNA-sequencing reads, we found based on the available A. halleri genome. We found the same trend 568, 7,717, and 13,820 genes with positive selection (K /K > 1), (P-value ¼ 0.04 for positive selection, P-value ¼ 0.02 for purifying A S purifying selection (K /K < 1) and unclear selection, respectively, in selection in the available genome by v test, Fig. 4H). These results A S the A. halleri lineage by the maximum likelihood approach indicate that gene duplication in the A. halleri-lyrata lineage was the (Supplementary Table S3). To examine the relationship between the main determinant of the elevated proportion of positive selection in effect of gene duplication and selection pressures in the A. halleri the A. halleri lineage. Downloaded from https://academic.oup.com/dnaresearch/advance-article-abstract/doi/10.1093/dnares/dsy005/4898128 by Ed 'DeepDyve' Gillespie user on 08 June 2018 K. Hanada et al. 9 AB Gene duplication in A. halleri-lyrata lineage 80% Gene duplication in A. halleri-lyrata lineage 100% Non-gene duplication in A. halleri-lyrata lineage Non-gene duplication in A. halleri-lyrata lineage 80% 60% 60% 40% 40% 20% 20% 0% 0% Positive Unclear Purifying Positive Unclear Purifying selection selection selection selection selection selection A. halleri-lyrata lineage A. halleri-lyrata lineage CD Gene duplication in A. halleri lineage Gene duplication in A. halleri lineage 80% 80% Non-gene duplication in A. halleri lineage Non-gene duplication in A. halleri lineage 60% 60% 40% * 40% 20% 20% 0% 0% Positive Unclear Purifying Positive Unclear Purifying selection selection selection selection selection selection A. halleri lineage A. halleri lineage EF 80% Gene duplication in A. halleri-lyrata lineage Gene duplication in A. halleri-lyrata lineage 80% Non-gene duplication in A. halleri-lyrata lineage Non-gene duplication in A. halleri-lyrata lineage 60% 60% 40% 40% * 20% 20% * * 0% 0% Positive Unclear Purifying Positive Unclear Purifying selection selection selection selection selection selection A. halleri lineage A. halleri lineage Gene duplication in A. halleri-lyrata lineage Gene duplication in A. halleri-lyrata lineage GH but non-gene duplication in A. halleri lineage but non-gene duplication in A. halleri lineage Non-gene duplication in A. halleri-lyrate lineage Non-gene duplication in A. halleri-lyrate lineage 80% 80% but gene duplication in A. halleri lineage but gene duplication in A. halleri lineage 60% 60% 40% 40% 20% 20% * * 0% 0% Positive Unclear Purifying Positive Unclear Purifying selection selection selection selection selection selection A. halleri lineage A. halleri lineage Figure 4. Gene duplication and selection pressure in the A. halleri-lyrata and A. halleri lineages. Genes were classified as being under positive selection (K /K > 1), unclear selection or purifying selection (K /K < 1) in the A. halleri-lyrata and A. halleri lineages. Asterisks (*) represent significant differences by A S A S the chi-square test (P < 0.05). (A) The relationship between gene duplication and selection pressure in the A. lyrata-halleri lineage in 17,669 OGGs base on Illumina paired-end DNA-sequencing reads. (B) The relationship between gene duplication and selection pressure in the A. lyrata-halleri lineage in 17,925 OOGs based on the available A. halleri genome. (C) The relationship between gene duplication and selection pressure in the A. halleri lineage in 22,105 AT–AL–AH OGGs based on Illumina paired-end DNA-sequencing reads. (D) The relationship between gene duplication and selection pressure in the A. halleri lineage in 21,537 AT–AL–AH OOGs based on the available A. halleri genome. (E) The relationship between gene duplication in the A. lyrata-halleri lineage and selection pressure in the A. halleri lineage in 22,105 AT–AL–AH OGGs based on Illumina paired-end DNA-sequencing reads. (F) The relationship between gene duplication in the A. lyrata-halleri lineage and selection pressure in the A. halleri lineage in 21,537 AT–AL–AH OOGs based on the available A. halleri genome. (G) Comparison of selection pressure in the A. halleri lineage between gene duplication in only the A. lyrata-halleri lineage and gene duplication in only the A. halleri lineage in OGGs base on Illumina paired-end DNA-sequencing reads. (H) Comparison of selection pressure in the A. halleri lineage between gene du- plication in only the A. lyrata-halleri lineage and gene duplication in only the A. halleri lineage in OGGs based on the available A. halleri genome. Downloaded from https://academic.oup.com/dnaresearch/advance-article-abstract/doi/10.1093/dnares/dsy005/4898128 by Ed 'DeepDyve' Gillespie user on 08 June 2018 10 Functional divergence after gene duplication GO categories assigned to the A. thaliana genes belonging to the OGGs, 3.5. Functional bias of genes under positive or over-represented GO categories were identified (Supplementary Table purifying selection in the A. halleri-lyrata and A. halleri S6;see Section 2,FDR < 0.01). lineages When we focussed on metal and zinc tolerance or accumulation We found that gene duplication contributed to functional divergence among A. thaliana, A. lyrata and A. halleri, both A. lyrata and A. in comparison with non-duplicated genes in Arabidopsis but many halleri had a higher tolerance to metal and zinc than A. thaliana. In duplicated genes had been retained with purifying selection, which in- particular, both metal and zinc tend to be accumulated in A. halleri duces an increase of gene dosage. Thus, we were interested in investi- 62,63 compared with A. lyrata. This observation suggests that metal gating the functional bias in duplicated and non-duplicated genes tolerance is enhanced in the A. halleri–A. lyrata lineage, and that with positive/purifying selection in the A. halleri-lyrata and A. halleri metal accumulation is enhanced in the A. halleri lineage. Genes asso- lineages. OGGs based on Illumina paired-end DNA-sequencing reads ciated with metal and zinc transporters/responses were highly dupli- were classified into duplicated and non-duplicated genes in the A. hal- cated with purifying selection in the A. halleri-lyrata lineage, leri-lyrata and A. halleri lineages. The OGGs were then further classi- indicating that the dosages of genes associated with metal responses fied by positive and purifying selection. This gave a total of eight and transporters had been enhanced in the A. halleri-lyrata lineage. classes. In each class, we examined the degree of over-representation Furthermore, genes associated with metal and zinc responses were in 1,504 GO categories compared with the number of GO categories highly duplicated with purifying selection in the A. halleri-lyrata line- assigned to A. thaliana genes (see Section 2) (Fig. 5). Although two age, indicating that the dosages of genes associated with metal and classes with no duplication and purifying selection were clustered into zinc responses had been enhanced in the A. halleri-lyrata lineage. one group, the other six classes were not essentially clustered with Thus, tolerance or accumulation of metal and zinc was enhanced in each other. These results indicate that the evolutionary direction of the A. halleri-lyrata and A. halleri lineages through gene duplication. A. halleri was quite different in the A. halleri-lyrata and A. halleri lin- Genes associated with the reproductive system, cell cycle, various de- eages with respect to either gene dosage by gene duplication or func- velopments, various metabolites, epigenetics, metabolites, abiotic and tional divergence by positive selection. biotic responses were subject to positive selection in either the A. hal- To examine the kinds of genes associated with phenotypic differences leri-lyrata or A. halleri lineage (Supplementary Table S6). However, we among A. thaliana, A. lyrata and A. halleri, we identified significantly do not have any clear idea why such genes were subject to positive over-represented GO categories in OGGs with gene duplication and pu- selection. Additionally, genes associated with various ubiquitous pro- rifying selection, OGGs with gene duplication and positive selection and cesses tended to be duplicated in either the A. halleri-lyrata or the A. OGGs with non-duplication and positive selection in the A. halleri–A. halleri lineage with purifying selection (Supplementary Table S6). These lyrata and A. halleri lineages. OGGs with non-duplication and purifying over-represented functional categories may contribute to phenotypic selection were disregarded in the analysis because such genes tended to differences between A. thaliana and either A. lyrata or A. halleri have the same functions in A. halleri, A. lyrata and A. halleri.From the Color Key −2 −1 0 1 2 Figure 5. Over-represented functional categories. The X-axis represents different kinds of OGGs. OGGs were classified into duplicated genes and non-dupli- cated genes in the A. halleri-lyrata (AHL) and A. halleri (AH) lineages. The OGGs were then further classified by positive or purifying selection. The Y-axis repre- sents 1,504 GO categories (biological processes) assigned to A. thaliana genes belonging to the OGGs. The key shows the relationship between colour and the z-score of over-representation in the GO categories. Red and green indicate low and high over-representation, respectively. The ratio of observed gene numbers in selected OGGs to expected gene numbers inferred from the data for all annotated genes was calculated in each GO category. Downloaded from https://academic.oup.com/dnaresearch/advance-article-abstract/doi/10.1093/dnares/dsy005/4898128 by Ed 'DeepDyve' Gillespie user on 08 June 2018 Duplication with positive selection in AH lineage Non-Duplication with purifying selection in AHL lineage Non-Duplication with purifying selection in AH lineage Non-Duplication with positive selection in AHL lineage Non-Duplication with positive selection in AH lineage Duplication with positive selection in AHL lineage Duplication with purifying selection in AH lineage Duplication with purifying selection in AHL lineage K. Hanada et al. 11 through high dosages. However, we do not know the phenotypic differ- result indicates that gene duplication tends to enhance functional di- ences associated with these functional categories. These duplicated vergence in comparison with non-duplicated genes in the genes might have been retained because increased gene dosages associ- Arabidopsis lineage. In contrast, the general observation is that du- ated with these functional categories were not too disadvantageous for plicated genes tend to have less functional divergence in yeasts, 24,68,69 A. halleri. To avoid gaining novel functions, these genes may be under plants and mammals. This is because functionally important purifying selection. In the future, these duplicated genes may be lost if genes are more likely to be retained as duplicates. This contradic- disadvantageous functions appear. Indeed, the A. halleri-lyrata and A. tory relationship might derive from the duplication ages. Most previ- halleri lineages have extraordinarily high retention rates of duplicated ous analyses have examined recently observed selection pressures in genes in comparison with earlier plant lineages. These observations in- anciently duplicated genes. When functional divergence was exam- dicate that most of the duplicated genes in the A. halleri-lyrata and ined in recently duplicated genes, the duplicated genes tended to have A. halleri lineages may be lost in future evolution. higher functional divergence than singletons. Together, these re- sults suggest duplicated genes tend to have higher functional diver- gence immediately after duplication than singletons. 3.6. Concluding remarks How long gene duplication accelerates functional divergence re- In this analysis, we generated 25,833 A. halleri genes that were mains an open question. To address this, we examined whether gene orthologous to 79.1% of the annotated A. lyrata genes from contigs duplication in the A. halleri-lyrata lineage (2–10 MYA) accelerated generated from only paired-end reads of Illumina DNA-sequencing. functional divergence in the A. halleri lineage (<2 MYA). On the other hands, we inferred 26,007 AH-AL OOGs based on the Interestingly, we found that gene duplication in the A. halleri-lyrata available A. halleri genome. Out of 32,670 A. lyrata genes, 79.6% lineage enhanced the proportion of positive selection in the A. halleri were identified as orthologous genes to A. halleri genes. Thus, our lineage (Fig. 4E–H). This result indicated that the functional diver- method inferring orthologous genes is compatible to a method to in- gence of duplicated genes was accelerated several MY after gene du- fer orthologous genes based on the available genome. However, it plication. If gene duplication is too deleterious for a gene, the gene has significant limitations for examining gene loss, exon shuffling tends to be lost immediately after duplication. If not, duplicated and heterozygosity because it does not infer any new genes in the ge- genes may be retained for a long period without functional diver- nomes. Nevertheless, the number of duplicated genes inferred by gence because functional divergence may be evolutionarily disadvan- reading depth of Illumina DNA-sequencing reads is likely to be more tageous. Therefore, immediately after duplication, most duplicated reliable in comparison with duplicated genes identified on scaffolds genes might be under functional constraints in comparison with generated by Illumina DNA-sequencing reads (Supplementary genes duplicated several MY earlier. Indeed, many recently dupli- 17,18 Fig. S5B). Therefore, this procedure is useful for inferring lineage- cated genes have functional redundancy in A. thaliana and in specific duplicated genes resulting from such events as SSDs. mammals. These young duplicated genes tend to be less function- The gain rates based on 25, 833 AL–AH OGGs based on both ally constrained than singletons, and may have the potential to ob- Illumina paired-end DNA-sequencing reads and the available A. hal- tain an essential function to survive in new environments. leri genome were 1.6–2.0  10 per gene per MY in the A. halleri- Finally, we examined the kinds of genes that were duplicated lyrata lineage (Fig. 2). Furthermore, using the only mapping coverage and/or under positive selection in the A. halleri-lyrata and A. halleri of the Illumina DNA-sequencing reads, the gain rate was inferred to lineages. Different functional categories tended to have experienced be 5.7–7.6  10 in the A. halleri lineage because gene duplication gene duplication and/or selection pressure in the A. halleri-lyrata and tended to be missed in a genome assembly based on Illumina short A. halleri lineages. For example, A. halleri is known as a heavy metal reads. That is, the gain rates were inferred to be 1.6–2.0 and 5.7–7.6 hyper-accumulator with high metal tolerance. A. lyrata is tolerant of 10 per gene per MY in the A. halleri-lyrata and A. halleri line- heavy-metal ions in the soil to some degree but A. thaliana is not. ages, respectively. The gain rate in the A. halleri lineage was approxi- Genes related to heavy-metal tolerance and accumulation tended to mately four times higher than in the A. halleri-lyrata lineage. Using be highly duplicated with purifying selection in the A. halleri-lyrata our previous data, we re-estimated the gain rates in the three time pe- and A. halleri lineages. Earlier studies reported that metal tolerance riods after the divergence of mosses, rice and poplar (Fig. 2). The in- was enhanced by increasing gene dosage through gene duplication. ferred gain rates (1.8–3.0  10 ) were 10 times lower than in the Our results supported this trend at a genomic scale. Taken together, Arabidopsis lineage (Fig. 2). Thus, gain rates tend to increase as the the results of our study reveal that lineage-specific duplicated genes evolutionary period gets younger. One explanation for this gain rate have contributed to species-specific evolution in Arabidopsis. difference is that duplicated genes tend to rapidly decay over time. This explanation is supported by a higher rate of pseudogenization in recently duplicated genes in comparison with non-duplicated 4. Availability genes (Table 1). Also, several previous reports showed that younger Illumina DNA-sequencing data (DRA004564) have been deposited in duplicated genes tended to be relaxed compared with older dupli- the DDBJ Sequence Read Archive (https://trace.ddbj.nig.ac.jp/dra/). 54,55,64–67 cated genes. That is, most anciently duplicated genes tend Contig sequences (BFAE01000001-BFAE01344622) assembled from not to be retained in current species. Consequently, the gain rates in- the Illumina DNA-sequencing data have been deposited in the DDBJ ferred in earlier evolutional periods tend to decrease. Mass Submission System. A. halleri gene sequences determined in this To investigate the functional divergence of duplicated genes in the study are included in Supplementary Tables S3 and S4. A. halleri-lyrata and A. halleri lineages, we identified OGGs under ei- ther positive or purifying selection in these lineages based on the ra- tio of nonsynonymous and synonymous substitution rates (K /K ). A S Acknowledgements Interestingly, the proportions of positive and purifying selection tended to increase and decrease, respectively, when gene duplication We thank Kiyomi Imamura, Makiko Tosaka, Taiji Kikuchi, Terumi Horiuchi, occurred in either the A. halleri-lyrata or A. halleri lineage. This Kanako Onizuka and Miu Kubota for Illumina DNA-sequencing analyses and Downloaded from https://academic.oup.com/dnaresearch/advance-article-abstract/doi/10.1093/dnares/dsy005/4898128 by Ed 'DeepDyve' Gillespie user on 08 June 2018 12 Functional divergence after gene duplication ddPCR analyses. We also thank the National Institute of Genetics of the 17. Hanada, K., Kuromori, T., Myouga, F., Toyoda, T., Li, W. H. and Research Organization of Information and Systems for providing excellent su- Shinozaki, K. 2009, Evolutionary persistence of functional compensation percomputer services. This work was supported by the Research for by duplicate genes in Arabidopsis, Genome Biol. Evol., 1, 409–14. Evolutional Science and Technology (CREST) programme ‘Creation of essen- 18. Hanada, K., Sawada, Y., Kuromori, T., et al. 2011, Functional compensa- tial technologies to utilize carbon dioxide as a resource through the enhance- tion of primary and secondary metabolites by duplicate genes in ment of plant productivity and the exploitation of plant products’ of the Japan Arabidopsis thaliana, Mol. Biol. Evol., 28, 377–82. Science and Technology Agency (JST) (JPMJCR11B3 to K.H. and S.I.M.); 19. Nowak, M. A., Boerlijst, M. C., Cooke, J. and Smith, J. M. 1997, Grants-in-Aid for Scientific Research (15K14421, 15H02433, K.H. and Evolution of genetic redundancy, Nature, 388, 167–71. S.I.M.) and research grants from the Takeda Science Foundation, the 20. Hanada, K., Kuromori, T., Myouga, F., Toyoda, T. and Shinozaki, K. Sumitomo Foundation, the Kurume Research Park and the Asahi Glass 2009, Increased expression and protein divergence in duplicate genes is as- Foundation (to K.H.). sociated with morphological diversification, PLoS Genet., 5, e1000781. 21. Alvarez-Ponce, D. and Fares, M. A. 2012, Evolutionary rate and duplic- ability in the Arabidopsis thaliana protein-protein interaction network, Conflict of interest Genome Biol. Evol., 4, 1263–74. 22. Warren, A. S., Anandakrishnan, R. and Zhang, L. 2010, Functional bias in None declared. molecular evolution rate of Arabidopsis thaliana, BMC Evol. Biol., 10,125. 23. Wang, J., Marowsky, N. C. and Fan, C. 2013, Divergent evolutionary and expression patterns between lineage specific new duplicate genes and Supplementary data their parental paralogs in Arabidopsis thaliana, PLoS One, 8, e72362. Supplementary data are available at DNARES online. 24. Yang, L. and Gaut, B. S. 2011, Factors that contribute to variation in evo- lutionary rate among Arabidopsis genes, Mol. Biol. Evol., 28, 2359–69. 25. Wendel, J. F., Jackson, S. A., Meyers, B. C. and Wing, R. A. 2016, References Evolution of plant genome architecture, Genome Biol. , 17, 37. 26. DeBarry, J. D. and Kissinger, J. C. 2014, A survey of innovation through 1. Lockton, S. and Gaut, B. S. 2005, Plant conserved non-coding sequences duplication in the reduced genomes of twelve parasites, PLoS One, 9, and paralogue evolution, Trends Genet., 21, 60–5. e99213. 2. Vanneste, K., Baele, G., Maere, S. and Van de Peer, Y. 2014, Analysis of 27. Sin, K., Street, N., Lundeberg, J. and Arvestad, L. 2012, Improved gap 41 plant genomes supports a wave of successful genome duplications in size estimation for scaffolding algorithms, Bioinformatics, 28, 2215–22. association with the Cretaceous-Paleogene boundary, Genome Res., 24, 28. Arabidopsis_Genome_Initiative 2000, Analysis of the genome sequence of 1334–47. the flowering plant Arabidopsis thaliana, Nature, 408, 796–815. 3. Hanada, K., Vallejo, V., Nobuta, K., et al. 2009, The functional role of 29. Hu, T. T., Pattyn, P., Bakker, E. G., et al. 2011, The Arabidopsis lyrata pack-MULEs in rice inferred from purifying selection and expression pro- genome sequence and the basis of rapid genome size change, Nature file, Plant Cell, 21, 25–38. Genetics, 43, 476–81. 4. Hanada, K., Zou, C., Lehti-Shiu, M. D., Shinozaki, K. and Shiu, S. H. 30. Beilstein, M. A., Nagalingum, N. S., Clements, M. D., Manchester, S. R. 2008, Importance of lineage-specific expansion of plant tandem duplicates and Mathews, S. 2010, Dated molecular phylogenies indicate a Miocene in the adaptive response to environmental stimuli, Plant Physiol., 148, origin for Arabidopsis thaliana, Proc. Natl. Acad. Sci. U. S. A., 107, 993–1003. 18724–8. 5. Fortna, A., Kim, Y., MacLaren, E., et al. 2004, Lineage-specific gene du- 31. Moghe, G. D., Hufnagel, D. E., Tang, H., et al. 2014, Consequences of plication and loss in human and great ape evolution, PLoS Biol., 2, E207. whole-genome triplication as revealed by comparative genomic analyses 6. Rostoks, N., Borevitz, J. O., Hedley, P. E., et al. 2005, Single-feature poly- of the wild radish raphanus raphanistrum and three other Brassicaceae morphism discovery in the barley transcriptome, Genome Biol., 6,R54. species, Plant Cell, 26, 1925–37. 7. Clark, R. M., Schweikert, G., Toomajian, C., et al. 2007, Common se- 32. Koenig, D. and Weigel, D. 2015, Beyond the thale: comparative genomics quence polymorphisms shaping genetic diversity in Arabidopsis thaliana, and genetics of Arabidopsis relatives, Nat. Rev. Genet., 16, 285–98. Science, 317, 338–42. 33. Kramer, U. 2010, Metal hyperaccumulation in plants, Ann. Rev. Plant 8. Rizzon, C., Ponger, L. and Gaut, B. S. 2006, Striking similarities in the ge- Biol., 61, 517–34. nomic distribution of tandemly arrayed genes in Arabidopsis and rice, 34. Verbruggen, N., Hermans, C. and Schat, H. 2009, Molecular mechanisms PLoS Comput Biol., 2, e115. of metal hyperaccumulation in plants, New Phytol., 181, 759–76. 9. Rodgers-Melnick, E., Mane, S. P., Dharmawardhana, P., et al. 2012, 35. Shimizu, K. K. and Purugganan, M. D. 2005, Evolutionary and ecological Contrasting patterns of evolution following whole genome versus tandem genomics of Arabidopsis, Plant Physiol., 138, 578–84. duplication events in Populus, Genome Res., 22, 95–105. 36. Turner, T. L., Bourne, E. C., Von Wettberg, E. J., Hu, T. T. and Nuzhdin, 10. Leister, D. 2004, Tandem and segmental gene duplication and recombina- S. V. 2010, Population resequencing reveals local adaptation of tion in the evolution of plant disease resistance gene, Trends Genet., 20, Arabidopsis lyrata to serpentine soils, Nat. Genet., 42, 260–3. 116–22. 37. Novikova, P. Y., Hohmann, N., Nizhynska, V., et al. 2016, Sequencing of 11. Ohno, S. 1970, Evolution by Gene Duplication. Springer-Verlag: New the genus Arabidopsis identifies a complex history of nonbifurcating spe- York. ciation and abundant trans-specific polymorphism, Nat. Genet, 48, 12. Force, A., Lynch, M., Pickett, F. B., Amores, A., Yan, Y. L. and 1077–82. Postlethwait, J. 1999, Preservation of duplicate genes by complementary, 38. Briskine, R. V., Paape, T., Shimizu-Inatsugi, R., et al. 2017, Genome as- degenerative mutations, Genetics, 151, 1531–45. sembly and annotation of Arabidopsis halleri, a model for heavy metal 13. Zou, C., Lehti-Shiu, M. D., Thibaud-Nissen, F., Prakash, T., Buell, C. R. hyperaccumulation and evolutionary ecology, Mol. Ecol. Resour., 17, and Shiu, S. H. 2009, Evolutionary and expression signatures of pseudo- 1025–36. genes in Arabidopsis and rice, Plant Physiol., 151, 3–15. 39. Akama, S., Shimizu-Inatsugi, R., Shimizu, K. K. and Sese, J. 2014, 14. Zheng, D. and Gerstein, M. B. 2007, The ambiguous boundary between Genome-wide quantification of homeolog expression ratio revealed non- genes and pseudogenes: the dead rise up, or do they? Trends Genet., 23, stochastic gene regulation in synthetic allopolyploid Arabidopsis, Nucleic 219–24. Acids Res., 42, e46. 15. Conant, G. C. and Wolfe, K. H. 2008, Turning a hobby into a job: how 40. Sato, Y. and Kudoh, H. 2014, Fine-scale genetic differentiation of a tem- duplicated genes find new functions, Nat. Rev. Genet., 9, 938–50. perate herb: relevance of local environments and demographic change, 16. Kondrashov, F. A. 2012, Gene duplication as a mechanism of genomic ad- AoB Plants, 6, plu070. aptation to a changing environment, Proc. Biol. Sci., 279, 5048–57. Downloaded from https://academic.oup.com/dnaresearch/advance-article-abstract/doi/10.1093/dnares/dsy005/4898128 by Ed 'DeepDyve' Gillespie user on 08 June 2018 K. Hanada et al. 13 41. Hanikenne, M., Talke, I. N., Haydon, M. J., et al. 2008, Evolution of co-segregate with zinc tolerance and account for high MTP1 transcript metal hyperaccumulation required cis-regulatory changes and triplication levels, Plant J., 39, 425–39. of HMA4, Nature, 453, 391–5. 59. Duarte, J. M., Wall, P. K., Edger, P. P., et al. 2010, Identification of shared 42. Al-Shehbaz, I. A. and O’Kane, S. L., Jr 2002, Taxonomy and phylogeny single copy nuclear genes in Arabidopsis, Populus, Vitis and Oryza and of Arabidopsis (Brassicaceae). In The Arabidopsis Book/American Society their phylogenetic utility across various taxonomic levels, BMC Evol. of Plant Biologists, Vol. 1. Biol., 10, 61. 43. Boratyn, G. M., Camacho, C., Cooper, P. S., et al. 2013, BLAST: a more 60. Wilgenbusch, J. C. and Swofford, D. 2003, Inferring evolutionary trees efficient report with usability improvements, Nucleic Acids Res., 41, with PAUP*, Curr. Protoc. Bioinformatics, Chapter 6, Unit 6 4. W29–33. 61. Saitou, N. and Nei, M. 1987, The neighbor-joining method: a new 44. Johnston, J. S., Pepper, A. E., Hall, A. E., et al. 2005, Evolution of genome method for reconstructing phylogenetic trees, Mol. Biol. Evol., 4, 406–25. size in Brassicaceae, Ann. Bot., 95, 229–35. 62. Willems, G., Drager, D. B., Courbot, M., Gode, C., Verbruggen, N. and 45. Kajitani, R., Toshimoto, K., Noguchi, H., et al. 2014, Efficient de novo as- Saumitou-Laprade, P. 2007, The genetic basis of zinc tolerance in the met- sembly of highly heterozygous genomes from whole-genome shotgun allophyte Arabidopsis halleri ssp. halleri (Brassicaceae): an analysis of short reads, Genome Res., 24, 1384–95. quantitative trait loci, Genetics, 176, 659–74. 46. Simpson, J. T., Wong, K., Jackman, S. D., Schein, J. E., Jones, S. J. and 63. Frerot, H., Faucon, M. P., Willems, G., et al. 2010, Genetic architecture Birol, I. 2009, ABySS: a parallel assembler for short read sequence data, of zinc hyperaccumulation in Arabidopsis halleri: the essential role of Genome Res., 19, 1117–23. QTL x environment interactions, New Phytol., 187, 355–67. 47. Luo, R., Liu, B., Xie, Y., et al. 2012, SOAPdenovo2: an empirically im- 64. Vishnoi, A., Kryazhimskiy, S., Bazykin, G. A., Hannenhalli, S. and proved memory-efficient short-read de novo assembler, Gigasci., 1, 18. Plotkin, J. B. 2010, Young proteins experience more variable selection 48. Wu, T. D. and Watanabe, C. K. 2005, GMAP: a genomic mapping and pressures than old proteins, Genome Res., 20, 1574–81. alignment program for mRNA and EST sequences, Bioinformatics, 21, 65. Jordan, I. K., Wolf, Y. I. and Koonin, E. V. 2004, Duplicated genes evolve 1859–75. slower than singletons despite the initial rate increase, BMC Evol. Biol., 4, 22. 49. Langdon, W. B. 2015, Performance of genetic programming optimised 66. Alba, M. M. and Castresana, J. 2005, Inverse relationship between evolu- Bowtie2 on genome comparison and analytic testing (GCAT) bench- tionary rate and age of mammalian genes, Mol. Biol. Evol., 22, 598–606. marks, BioData Min., 8,1. 67. Wolf, Y. I., Novichkov, P. S., Karev, G. P., Koonin, E. V. and Lipman, D. 50. Untergasser, A., Nijveen, H., Rao, X., Bisseling, T., Geurts, R. and J. 2009, The universal distribution of evolutionary rates of genes and dis- Leunissen, J. A. 2007, Primer3Plus, an enhanced web interface to tinct characteristics of eukaryotic genes of different apparent ages, Proc. Primer3, Nucleic Acids Res., 35, W71–4. Natl. Acad. Sci. U. S. A., 106, 7273–80. 51. Remm, M., Storm, C. E. and Sonnhammer, E. L. 2001, Automatic cluster- 68. Davis, J. C. and Petrov, D. A. 2004, Preferential duplication of conserved ing of orthologs and in-paralogs from pairwise species comparisons, proteins in eukaryotic genomes, PLoS Biol., 2, E55. J. Mol. Biol., 314, 1041–52. 69. Yang, J., Gu, Z. and Li, W. H. 2003, Rate of protein evolution versus fit- 52. Katoh, K., Misawa, K., Kuma, K. and Miyata, T. 2002, MAFFT: a novel ness effect of gene deletion, Mol. Biol. Evol., 20, 772–4. method for rapid multiple sequence alignment based on fast Fourier trans- 70. Satake, M., Kawata, M., McLysaght, A. and Makino, T. 2012, Evolution form, Nucleic Acids Res., 30, 3059–66. of vertebrate tissues driven by differential modes of gene duplication, 53. Yang, Z. 2007, PAML 4: phylogenetic analysis by maximum likelihood, DNA Res., 19, 305–16. Mol. Biol. Evol., 24, 1586–91. 71. Lan, X. and Pritchard, J. K. 2016, Coregulation of tandem duplicate genes 54. Lynch, M. and Conery, J. S. 2000, The evolutionary fate and conse- slows evolution of subfunctionalization in mammals, Science, 352, 1009–13. quences of duplicate genes, Science, 290, 1151–5. 72. Rensing, S. A., Lang, D., Zimmer, A. D., et al. 2008, The Physcomitrella 55. Bu, L. and Katju, V. 2015, Early evolutionary history and genomic fea- genome reveals evolutionary insights into the conquest of land by plants, tures of gene duplicates in the human genome, BMC Genomics, 16, 621. Science, 319, 64–9. 56. Gao, L. Z. and Innan, H. 2004, Very low gene duplication rate in the 73. Heckman, T. M. and Kauffmann, G. 2011, The coevolution of galaxies yeast genome, Science, 306, 1367–70. and supermassive black holes: a local perspective, Science, 333, 182–5. 57. Heredia, N. J., Belgrader, P., Wang, S., et al. 2013, Droplet Digital PCR 74. Tuskan, G. A., Difazio, S., Jansson, S., et al. 2006, The genome of black cot- quantitation of HER2 expression in FFPE breast cancer samples, tonwood, Populus trichocarpa (Torr. & Gray), Science, 313, 1596–604. Methods, 59, S20–3. 75. Wolfe, K. H., Gouy, M., Yang, Y. W., Sharp, P. M. and Li, W. H. 1989, 58. Drager, D. B., Desbrosses-Fonrouge, A. G., Krach, C., et al. 2004, Two Date of the monocot-dicot divergence estimated from chloroplast DNA genes encoding Arabidopsis halleri MTP1 metal transport proteins sequence data, Proc. Natl. Acad. Sci. U. S. A., 86, 6201–5. Downloaded from https://academic.oup.com/dnaresearch/advance-article-abstract/doi/10.1093/dnares/dsy005/4898128 by Ed 'DeepDyve' Gillespie user on 08 June 2018

Journal

DNA ResearchOxford University Press

Published: Feb 21, 2018

There are no references for this article.

You’re reading a free preview. Subscribe to read the entire article.


DeepDyve is your
personal research library

It’s your single place to instantly
discover and read the research
that matters to you.

Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month

Explore the DeepDyve Library

Search

Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly

Organize

Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.

Access

Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.

Your journals are on DeepDyve

Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.

All the latest content is available, no embargo periods.

See the journals in your area

DeepDyve

Freelancer

DeepDyve

Pro

Price

FREE

$49/month
$360/year

Save searches from
Google Scholar,
PubMed

Create lists to
organize your research

Export lists, citations

Read DeepDyve articles

Abstract access only

Unlimited access to over
18 million full-text articles

Print

20 pages / month

PDF Discount

20% off