TodoFirGene: Developing Transcriptome Resources for Genetic Analysis of Abies sachalinensis

TodoFirGene: Developing Transcriptome Resources for Genetic Analysis of Abies sachalinensis Abstract Todo-matsu (Abies sachalinensis) is one of the most important forestry species in Hokkaido, Japan and is distributed from near sea level to the alpine zone. Due to its wide spatial distribution, the species adapts to its environment, displaying phenotypes of ecological relevance. In order to identify candidate genes under natural selection, we collected the transcriptome from the female and male flower, leaf and inner bark. De novo assembly with 34.7 Gb of sequencing reads produced 158,542 transcripts from 69,618 loci, whose estimated coverage reached 95.6% of conserved eukaryotic genes. Homology searches against publicly available databases identified 134,190 (84.6%) transcripts with at least one hit. In total, 28,944 simple sequence repeats (SSRs) and 80,758 single nucleotide variants (SNVs) were detected from 23,570 (14.9%) and 25,366 (16.0%) transcripts, which were valuable for use in genetic analysis of the species. All the annotations were included in a relational database, TodoFirGene, which provides an interface for various queries and homology search, and can be accessed at http://plantomics.mind.meiji.ac.jp/todomatsu/. This database hosts not only the A. sachalinensis transcriptome but also links to the proteomes of 13 other species, allowing a comparative genomic study of plant species. Introduction Abies sachalinensis (known as todo-matsu, todo-fir or Sakhalin fir) is a coniferous, wind-pollinated and wind-dispersed species in the Pinaceae family (2n = 24) (Yamazaki 1995). This conifer plays an important role in timber use and forms an important component of forest ecosystems as a dominant climax species of the sub-boreal zone in northern Japan (Kato 1952). The geographic range of this species mainly extends from Hokkaido, the northern island of Japan, to Sakhalin Island. The elevational distribution extends from near sea level to the alpine zone, approximately 1,600 m a.s.l. (Yamazaki 1995). Due to this wide ecological distribution, large phenotypic variations along the environmental gradient were well known, suggesting adaptation to the habitat (Eiga and Sakai 1984, Eiga and Sakai 1987). A reciprocal transplant experiment along the elevational gradient showed that both upslope and downslope transplantation reduced productivity, indicating the home-site advantage of the local population (Ishizuka and Goto 2012). Despite being planted under favorable environmental conditions on a downslope site, plants derived from the upslope population showed inferior growth to those from other sites. Intraspecific adaptation to the local climate is thought to be driven by genome-based control of the responsible traits, such as autumn phenology (Ishizuka et al. 2015). However, the presence of relevant genes and/or their mutations and their potential for local adaptation have not been sufficiently examined for this species. In order to evaluate the genes responsible and their ecological relevance, gene database/catalogs must be an appropriate tool for primary screening for candidate genes, which is, however, lacking for this species and its genus Abies. Transcriptome sequencing or RNA-seq (RNA sequencing) is a method to capture whole expressed genes in a specific tissue. The resulting vast amount of data can be utilized for the analysis of digital gene expression as well as for the construction of transcriptome contigs or transcripts in non-model organisms for which no reference sequences are available. Development of a reference gene set is a cost-effective strategy of first choice for non-model organisms, enabling a detailed downstream genetic study by providing an atlas of the expressed genes. Although sequence information from related species such as A. alba (Roschanski et al. 2013) and A. balsamea (Zerbe et al. 2012) could be transferred, detailed genetic analysis at species level needs reference sequences from the target species. In the present study, our objectives were to develop transcriptome resources for A. sachalinensis, for which no genome resources are available to date. The resulting transcript sequences were annotated to construct a database, TodoFirGene, which will be used to identify candidate genes of ecological relevance in Abies species in future studies. Furthermore, orthologous sequences in 14 species including A. sachalinensis were identified for comparative studies, which reinforce the omics information of gymnosperms and connect researchers from forest sciences with those in comparative bioinformatics and evolutionary sciences. Results and Discussion Construction of the reference transcriptome In total we collected 36.5 Gb of sequencing reads from four tissues, ranging from 7.4 Gb from the inner bark to 10.9 Gb from the female flower (Table 1). After pre-processing, 34.7 Gb (95%) of reads were retained and used for assembly. The assembler, Oases (Schulz et al. 2012), with k-mers from 21 to 95 produced 2,550,215 contigs in total, ranging from 28,383 (k = 95) to 357,191 (k = 21) (Supplementary Table S1), which EvidentialGene (Gilbert 2013) clustered into 158,542 transcripts with an estimated 69,618 loci. Each locus had a mean and maximum of 2.28 and 123 transcripts, respectively. The length of transcripts varied from 180 to 19,482 bp, with an average of 1,560 bp (Fig. 1), which we defined as the ‘Public set’ of the reference transcriptome for A. sachalinensis (TodoFirGene) (Table 2). Table 1 Transcriptome sequencing of A. sachalinensis by HiSeq2000 Sample ID  Tissue  No. of read pairs  Total base output (Gbp)  No. of read pairs after cleaning  Total cleaned base (Gbp)  Im1996  Male flower  42,348,328  8.55  41,601,106  8.32  Im1997  Female flower  54,183,830  10.95  52,807,684  10.56  Im1998  Leaf  47,580,254  9.61  45,458,689  9.10  Im1999  Inner bark  36,607,117  7.39  33,437,314  6.68  Total    180,719,529  36.50  173,304,793  34.66  Sample ID  Tissue  No. of read pairs  Total base output (Gbp)  No. of read pairs after cleaning  Total cleaned base (Gbp)  Im1996  Male flower  42,348,328  8.55  41,601,106  8.32  Im1997  Female flower  54,183,830  10.95  52,807,684  10.56  Im1998  Leaf  47,580,254  9.61  45,458,689  9.10  Im1999  Inner bark  36,607,117  7.39  33,437,314  6.68  Total    180,719,529  36.50  173,304,793  34.66  Table 1 Transcriptome sequencing of A. sachalinensis by HiSeq2000 Sample ID  Tissue  No. of read pairs  Total base output (Gbp)  No. of read pairs after cleaning  Total cleaned base (Gbp)  Im1996  Male flower  42,348,328  8.55  41,601,106  8.32  Im1997  Female flower  54,183,830  10.95  52,807,684  10.56  Im1998  Leaf  47,580,254  9.61  45,458,689  9.10  Im1999  Inner bark  36,607,117  7.39  33,437,314  6.68  Total    180,719,529  36.50  173,304,793  34.66  Sample ID  Tissue  No. of read pairs  Total base output (Gbp)  No. of read pairs after cleaning  Total cleaned base (Gbp)  Im1996  Male flower  42,348,328  8.55  41,601,106  8.32  Im1997  Female flower  54,183,830  10.95  52,807,684  10.56  Im1998  Leaf  47,580,254  9.61  45,458,689  9.10  Im1999  Inner bark  36,607,117  7.39  33,437,314  6.68  Total    180,719,529  36.50  173,304,793  34.66  Table 2 Assembly statistics of the A. sachalinensis transcriptome Number of contigs  158,542  Number of nucleotides (bp)  247,403,210  Number of estimated loci  69,618  N50 (bp)  2,239  Average length (bp)  1,560  No. of coding sequences (CDS)  158,542  Average nucleotide length of CDS (bp)  1,150  Number of contigs  158,542  Number of nucleotides (bp)  247,403,210  Number of estimated loci  69,618  N50 (bp)  2,239  Average length (bp)  1,560  No. of coding sequences (CDS)  158,542  Average nucleotide length of CDS (bp)  1,150  Table 2 Assembly statistics of the A. sachalinensis transcriptome Number of contigs  158,542  Number of nucleotides (bp)  247,403,210  Number of estimated loci  69,618  N50 (bp)  2,239  Average length (bp)  1,560  No. of coding sequences (CDS)  158,542  Average nucleotide length of CDS (bp)  1,150  Number of contigs  158,542  Number of nucleotides (bp)  247,403,210  Number of estimated loci  69,618  N50 (bp)  2,239  Average length (bp)  1,560  No. of coding sequences (CDS)  158,542  Average nucleotide length of CDS (bp)  1,150  Fig. 1 View largeDownload slide Histogram of transcript length for A. sachalinensis. Fig. 1 View largeDownload slide Histogram of transcript length for A. sachalinensis. Evaluation of the reference transcriptome When the comprehensiveness of TodoFirGene was evaluated by CEGMA (core eukaryotic gene mapping approach) (Parra et al. 2007), 237 (95.6%) of 248 core eukaryotic genes (CEGs) were found. As for chimera detection, the method of Yang and Smith (YS method) with optimize_assembler (Yang and Smith 2013) identified 1,212 (0.76%) transcripts as chimeric, while FrameDP (Gouzy et al. 2009) identified 152,715 protein sequences from 158,542 transcripts (TodoFirGene), resulting in 14,418 (9.1%) of the reference with multiple protein sequences estimated. In all, 748 (0.47%) transcripts were identified by both YS and FrameDP methods, and ‘High’ levels of chimeric risk were suggested, while 14,134 (8.9%) transcripts were possible chimeras identified by either of the two methods and labeled as ‘Middle’ chimeric risk. In our preliminary assembly and clustering of the transcriptome of A. sachalinensis, we used Oases with k-mers 31, 41, 51, 61 and 71. After splitting contig sequences at ‘N’ sites, sequences were then merged with the Oases ‘-merge’ option and clustered with CD-HIT-EST (Li and Godzik 2006), which produced 223,621 transcripts with an estimated 71,707 loci. This portion of the transcriptome covered as much as 99.6% of 248 CEGs; however, 6,631 (4.5%) transcripts were estimated as possible chimeras by the YS method, showing that the rate of chimeric sequences (4.5%) was much higher than that (0.90%) of the ‘Public set’ (Table 3). The ‘-merge’ option by Oases may contribute to a rise in the rate of chimeric transcripts (Yang and Smith 2013). In contrast, when we selected ‘OK main’ only and excluded ‘alt’ categories from the output of EvidentialGene, the resulting set contained 53,798 transcripts from 33,406 loci. Unfortunately, this smallest and compact set of the transcriptome covers 91.9% of 248 CEGs, and 543 (1.35%) transcripts are identified as possible chimeras by the YS method. We therefore considered the ‘Public set’ suitable for defining the reference transcriptome of A. sachalinensis (TodoFirGene). Table 3 Evaluation of transcriptome assembly for A. sachalinensis, with three sets of transcriptome assembly considered Set  No. of transcripts  Average length (bp)  Comprehensiveness (%)  Chimeric sequence (%)  Global  223,621  1,325  99.6  4.50  Essential  53,798  1,324  91.9  1.35  Public  158,542  1,560  95.6  0.90  Set  No. of transcripts  Average length (bp)  Comprehensiveness (%)  Chimeric sequence (%)  Global  223,621  1,325  99.6  4.50  Essential  53,798  1,324  91.9  1.35  Public  158,542  1,560  95.6  0.90  Table 3 Evaluation of transcriptome assembly for A. sachalinensis, with three sets of transcriptome assembly considered Set  No. of transcripts  Average length (bp)  Comprehensiveness (%)  Chimeric sequence (%)  Global  223,621  1,325  99.6  4.50  Essential  53,798  1,324  91.9  1.35  Public  158,542  1,560  95.6  0.90  Set  No. of transcripts  Average length (bp)  Comprehensiveness (%)  Chimeric sequence (%)  Global  223,621  1,325  99.6  4.50  Essential  53,798  1,324  91.9  1.35  Public  158,542  1,560  95.6  0.90  Annotation of TodoFirGene Blast searches for the TodoFirGene transcriptome identified a minimum of 98,605 (62.2%) and a maximum of 124,381 (78.4%) hits against UniProt (Swiss-Prot) and Picea abies protein database sequences, respectively (Table 4), totaling 136,290 (86.0%) transcripts with at least one hit. When taxonomy information in the NR (non-redundant) database entry was considered, we found that 3,619 (2.3%) transcripts may come from organisms other than A. sachalinensis (Supplementary Table S2), which may co-occur with the species. InterProScan added 1,768,327 annotations for 111,395 (70.3%) putative proteome estimated by EvidentialGene (Table 4). In total, 1,852 Gene Ontology (GO) and 85 GOslim terms were extracted, which could be distributed among 83,969 (53.0%) sequences. According to the KEGG Automatic Annotation Server (KAAS), 29,751 (18.8%) of transcripts were mapped to 3,314 Kyoto Encyclopedia of Genes and Genomes (KEGG) entries, with disease resistance protein RPS2 (K13459) the most frequent, of which 1,507 (0.95%) of transcripts were included. The average number of sequences within a single KEGG ortholog group was 23.1. Analysis of intracellular localization with the SignalP, ChloroP and WoLF PSORT programs identified, respectively, 13,041 (8.2%), 20,383 (12.9%) and 158,500 (100.0%) putative reference proteins with intracellular localization signals. Table 4 Annotation statistics of TodoFirGene Database/character  No. of transcripts with hits/relation  Percentage  NR  120,758  76.2%  UniProt  98,605  62.2%  Picea abies protein  124,381  78.4%  Populus trichocarpa protein  106,670  67.3%  Abies alba nucleotide  114,038  71.9%  KEGG  25,248  15.9%  InterPro  111,395  70.3%      Coils  19,026  12.0%      Gene3D  80,632  50.9%      Hamap  1,849  1.2%      PIRSF  4,097  2.6%      PRINTS  20,968  13.2%      Pfam  99,768  62.9%      ProDom  834  0.5%      ProSitePatterns  24,181  15.3%      ProSiteProfiles  42,074  26.5%      SMART  32,295  20.4%      SUPERFAMILY  84,869  53.5%      TIGRFAM  10,845  6.8%      Gene ontology  80,406  50.7%  ChloroP  13,041  8.2%  SignalP  20,383  12.9%  WoLF PSORT  158,500  100.0%  RepeatMasker  24,050  15.2%  MISA (Microsatellites)  23,570  14.9%  SNV/MNV  94,755  59.8%  Database/character  No. of transcripts with hits/relation  Percentage  NR  120,758  76.2%  UniProt  98,605  62.2%  Picea abies protein  124,381  78.4%  Populus trichocarpa protein  106,670  67.3%  Abies alba nucleotide  114,038  71.9%  KEGG  25,248  15.9%  InterPro  111,395  70.3%      Coils  19,026  12.0%      Gene3D  80,632  50.9%      Hamap  1,849  1.2%      PIRSF  4,097  2.6%      PRINTS  20,968  13.2%      Pfam  99,768  62.9%      ProDom  834  0.5%      ProSitePatterns  24,181  15.3%      ProSiteProfiles  42,074  26.5%      SMART  32,295  20.4%      SUPERFAMILY  84,869  53.5%      TIGRFAM  10,845  6.8%      Gene ontology  80,406  50.7%  ChloroP  13,041  8.2%  SignalP  20,383  12.9%  WoLF PSORT  158,500  100.0%  RepeatMasker  24,050  15.2%  MISA (Microsatellites)  23,570  14.9%  SNV/MNV  94,755  59.8%  Table 4 Annotation statistics of TodoFirGene Database/character  No. of transcripts with hits/relation  Percentage  NR  120,758  76.2%  UniProt  98,605  62.2%  Picea abies protein  124,381  78.4%  Populus trichocarpa protein  106,670  67.3%  Abies alba nucleotide  114,038  71.9%  KEGG  25,248  15.9%  InterPro  111,395  70.3%      Coils  19,026  12.0%      Gene3D  80,632  50.9%      Hamap  1,849  1.2%      PIRSF  4,097  2.6%      PRINTS  20,968  13.2%      Pfam  99,768  62.9%      ProDom  834  0.5%      ProSitePatterns  24,181  15.3%      ProSiteProfiles  42,074  26.5%      SMART  32,295  20.4%      SUPERFAMILY  84,869  53.5%      TIGRFAM  10,845  6.8%      Gene ontology  80,406  50.7%  ChloroP  13,041  8.2%  SignalP  20,383  12.9%  WoLF PSORT  158,500  100.0%  RepeatMasker  24,050  15.2%  MISA (Microsatellites)  23,570  14.9%  SNV/MNV  94,755  59.8%  Database/character  No. of transcripts with hits/relation  Percentage  NR  120,758  76.2%  UniProt  98,605  62.2%  Picea abies protein  124,381  78.4%  Populus trichocarpa protein  106,670  67.3%  Abies alba nucleotide  114,038  71.9%  KEGG  25,248  15.9%  InterPro  111,395  70.3%      Coils  19,026  12.0%      Gene3D  80,632  50.9%      Hamap  1,849  1.2%      PIRSF  4,097  2.6%      PRINTS  20,968  13.2%      Pfam  99,768  62.9%      ProDom  834  0.5%      ProSitePatterns  24,181  15.3%      ProSiteProfiles  42,074  26.5%      SMART  32,295  20.4%      SUPERFAMILY  84,869  53.5%      TIGRFAM  10,845  6.8%      Gene ontology  80,406  50.7%  ChloroP  13,041  8.2%  SignalP  20,383  12.9%  WoLF PSORT  158,500  100.0%  RepeatMasker  24,050  15.2%  MISA (Microsatellites)  23,570  14.9%  SNV/MNV  94,755  59.8%  In ortholog analysis with a 14-plant proteome set (Table 5), 212,540 orthogroups were identified, 59,553 (28.0%) of which contained at least one A. sachalinensis sequence, while none of the sequences was related to the remaining 152,987 (72.0%) orthogroups (Fig. 2). When we focused on orthogroups with A. sachalinensis sequences, those with only A. sachalinensis (unique orthogroup) were the most common [46,280 (21.8%) of the total orthogroup], followed by orthogroups with all 14 species [5,592 (2.6%) common orthogroups]. This may result from a unique character of A. sachalinensis and a universal feature of the plant transcriptome. The unique orthogroup contained 47,949 transcripts from TodoFirGene, 32,899 (68.6%) of which had no hits against the NR database, while the common orthogroup contained 53,205 transcripts, only 581 (1.1%) of which had no hits against the NR database. Table 5 Data sources used for ortholog detection by orthofinder Species  Gene models and version  Source  Abies sachalinensis  –  TodoFirGene  Arabidopsis thaliana  TAIR10  TAIR  Glycine max  Wm82.a2.v1  Phytozome  Medicago truncatula  Mt4.0v1  Medicago Truncatula Genome Project v4.0  Nicotiana tabacum  TN90_AYMY-SS  SGN  Oryza sativa  RAP-DB 2015-03-31  RAP-DB  Physcomitrella patens  v3.1  Phytozome  Picea abies  v1.0  Congenie  Populus trichocarpa  v3.1  Phytozome  Solanum lycopersicum  ITAG2.4  SGN  Solanum tuberosum  V3.4  PGSC  Sorghum bicolor  v2.1  Phytozome  Vitis vinifera  V2.1  Grape genome database  Zea mays  AGPv3.23  Phytozome  Species  Gene models and version  Source  Abies sachalinensis  –  TodoFirGene  Arabidopsis thaliana  TAIR10  TAIR  Glycine max  Wm82.a2.v1  Phytozome  Medicago truncatula  Mt4.0v1  Medicago Truncatula Genome Project v4.0  Nicotiana tabacum  TN90_AYMY-SS  SGN  Oryza sativa  RAP-DB 2015-03-31  RAP-DB  Physcomitrella patens  v3.1  Phytozome  Picea abies  v1.0  Congenie  Populus trichocarpa  v3.1  Phytozome  Solanum lycopersicum  ITAG2.4  SGN  Solanum tuberosum  V3.4  PGSC  Sorghum bicolor  v2.1  Phytozome  Vitis vinifera  V2.1  Grape genome database  Zea mays  AGPv3.23  Phytozome  Table 5 Data sources used for ortholog detection by orthofinder Species  Gene models and version  Source  Abies sachalinensis  –  TodoFirGene  Arabidopsis thaliana  TAIR10  TAIR  Glycine max  Wm82.a2.v1  Phytozome  Medicago truncatula  Mt4.0v1  Medicago Truncatula Genome Project v4.0  Nicotiana tabacum  TN90_AYMY-SS  SGN  Oryza sativa  RAP-DB 2015-03-31  RAP-DB  Physcomitrella patens  v3.1  Phytozome  Picea abies  v1.0  Congenie  Populus trichocarpa  v3.1  Phytozome  Solanum lycopersicum  ITAG2.4  SGN  Solanum tuberosum  V3.4  PGSC  Sorghum bicolor  v2.1  Phytozome  Vitis vinifera  V2.1  Grape genome database  Zea mays  AGPv3.23  Phytozome  Species  Gene models and version  Source  Abies sachalinensis  –  TodoFirGene  Arabidopsis thaliana  TAIR10  TAIR  Glycine max  Wm82.a2.v1  Phytozome  Medicago truncatula  Mt4.0v1  Medicago Truncatula Genome Project v4.0  Nicotiana tabacum  TN90_AYMY-SS  SGN  Oryza sativa  RAP-DB 2015-03-31  RAP-DB  Physcomitrella patens  v3.1  Phytozome  Picea abies  v1.0  Congenie  Populus trichocarpa  v3.1  Phytozome  Solanum lycopersicum  ITAG2.4  SGN  Solanum tuberosum  V3.4  PGSC  Sorghum bicolor  v2.1  Phytozome  Vitis vinifera  V2.1  Grape genome database  Zea mays  AGPv3.23  Phytozome  Fig. 2 View largeDownload slide Ortholog analysis with 14 proteome sets. The x-axis indicates the number of species in an orthogroup, and the y-axis displays the number of genes (transcripts) included in the corresponding cluster category. The abbreviations are as follows: As, Abies sachalinensis; At, Arabidopsis thaliana; Gm, Glycine max; Mt, Medicago truncatula; Nt, Nicotiana tabacum; Os, Oryza sativa; Pa, Picea abies; Pp, Physcomitrella patens; Pt, Populus trichocarpa; Sb, Sorghum bicolor; Sl, Solanum lycopercicum; St, Solanum tuberosum; and Vv, Vitis vinifera. Fig. 2 View largeDownload slide Ortholog analysis with 14 proteome sets. The x-axis indicates the number of species in an orthogroup, and the y-axis displays the number of genes (transcripts) included in the corresponding cluster category. The abbreviations are as follows: As, Abies sachalinensis; At, Arabidopsis thaliana; Gm, Glycine max; Mt, Medicago truncatula; Nt, Nicotiana tabacum; Os, Oryza sativa; Pa, Picea abies; Pp, Physcomitrella patens; Pt, Populus trichocarpa; Sb, Sorghum bicolor; Sl, Solanum lycopercicum; St, Solanum tuberosum; and Vv, Vitis vinifera. Digital expression analysis with RSEM (RNA-Seq by expectation-maximization) identified candidate transcripts that are expressed specifically in each tissue (Fig. 3). Inner bark was the most characteristic tissue in terms of gene expression, because 10,373 transcripts or 4,793 genes were uniquely expressed in this tissue. In contrast, leaf tissue contained only 3,773 and 770 leaf-specific transcripts and genes, respectively. Fig. 3 View largeDownload slide Venn diagram for read distributions for each tissue which were mapped onto TodoFirGene. The figures beneath the tissue name represent the number of transcripts that have FPKM (fragments per kilobase of transcript per million mapped reads) >0. Levels of expression were quantified for each transcript (a) and gene (b). Fig. 3 View largeDownload slide Venn diagram for read distributions for each tissue which were mapped onto TodoFirGene. The figures beneath the tissue name represent the number of transcripts that have FPKM (fragments per kilobase of transcript per million mapped reads) >0. Levels of expression were quantified for each transcript (a) and gene (b). As for repeat sequences, RepeatMasker masked 1,452,758 bp, which amounted to 0.59% of the total transcript sequences (Supplementary Table S3) and was distributed among 24,050 (15.2%) transcripts. Among the repeat sequences, simple sequence repeats (SSRs) were the most frequent, and RepeatMasker identified simple sequences or low-complexity sequences in 23,467 (14.8%) transcripts. The MISA program also discovered 28,944 SSRs in 23,570 (14.9%) of transcripts. The most frequent repeats were mono-SSRs, followed by di-SSRs (Fig. 4). In the coding region, the frequency of tri- and hexanucleotide repeats was high, where gain or loss of other nucleotide repeats caused frameshift mutations, resulting in protein malfunction. When we exclude mononucleotide repeats, 17,008 repeats were found in 13,780 (8.7%) transcripts. The most frequent di-SSR and tri-SSR motifs were AT and AGG, respectively (Supplementary Fig. S1). Fig. 4 View largeDownload slide Frequency distribution for each microsatellite motif. Fig. 4 View largeDownload slide Frequency distribution for each microsatellite motif. Candidate single nucleotide variants (SNVs) were detected from 25,366 (16.0%) transcripts totaling 80,758 SNVs (3.2 SNVs/transcript on average). The ratio of transition to transversion mutations was 1.51 (Supplementary Table S4). Transcriptome database: TodoFirGene We have constructed a transcriptome database, TodoFirGene (Fig. 5), where most of the results of our analysis were loaded and are accessible through the Internet under the following URL: http://plantomics.mind.meiji.ac.jp/todomatsu/. The database can be queried by either keywords, sequences or expression levels from the front page (Fig. 5). On the query results page, links to detailed information for transcripts/loci are presented. The transcript detail page (Fig. 6) displays the transcript and annotation summary, followed by each annotation detail for KEGG orthology and pathway, InterProScan (including GO) and BLAST search results against NR and UniProt (Swiss-Prot) databases, protein sequences from spruce (P. abies) and poplar (Populus trichocarpa), and transcriptome sequences from silver fir (A. alba) databases. Further annotations were added for intracellular localization and orthologous groups of predicted proteins. For each sequence in orthogroups, the following links to the PODC (Plant Omics Data Center) will display the levels of gene expression in a specified species. The detailed expression for four tissues in A. sachalinensis is shown as a bar graph on the same transcript detail page as well as in JBrowse display, which includes unique read mapping (multiply mapped reads are not shown), and annotations such as protein translations, BLAST results, repeat sequences (microsatellites) and SNVs. Other information includes the chimera risk of the transcript, the nucleotide sequence and predicted protein sequence. For loci with multiple transcripts, the locus detail page (Fig. 7) is convenient, where superTranscript is constructed to which multiple transcripts are visually aligned. Locus and transcript level expression are also shown in a bar graph. The TodoFirGene website has a bulk download page (http://plantomics.mind.meiji.ac.jp/todomatsu/download.html), where transcript and coding sequences and predicted protein sequences are available. Pairwise comparisons of gene expression among four tissues (male flower, female flower, young flushing leaf and inner bark) in terms of FPKM (fragments per kilobase of transcript per million mapped reads) are shown in the Gene expression atlas page (http://plantomics.mind.meiji.ac.jp/todomatsu/exp_atlas.html), where the values of FPKM for transcript and locus level can be downloaded (Supplementary Fig. S2). Fig. 5 View largeDownload slide TodoFirGene database screen shot displaying the query page and response for Gene (key words) search, BLAST and Expression value search. Fig. 5 View largeDownload slide TodoFirGene database screen shot displaying the query page and response for Gene (key words) search, BLAST and Expression value search. Fig. 6 View largeDownload slide Browsing the Transcript detail page of TodoFirGene with the navigation on the left and the body on the right. (A) Summary for a transcript and annotation. (B) KEGG orthology and pathway. (C) InterProScan including protein families, domains and other signatures. (D) Gene Ontology and GO slim terms for plants. (E) BLAST results against NR, UniProt (Swiss-Prot), spruce and poplar protein databases. (F) Intracellular localization predicted by ChloroP, SignalP and WoLF PSORT. (G) Orthologous groups predicted by orthofinder. (H) Expression in FPKM (fragments per kilobase of transcript per million mapped reads) units. (I) Assembly details including chimera risks and repeat sequences predicted by RepeatMasker. (J) Transcript and protein sequences. Fig. 6 View largeDownload slide Browsing the Transcript detail page of TodoFirGene with the navigation on the left and the body on the right. (A) Summary for a transcript and annotation. (B) KEGG orthology and pathway. (C) InterProScan including protein families, domains and other signatures. (D) Gene Ontology and GO slim terms for plants. (E) BLAST results against NR, UniProt (Swiss-Prot), spruce and poplar protein databases. (F) Intracellular localization predicted by ChloroP, SignalP and WoLF PSORT. (G) Orthologous groups predicted by orthofinder. (H) Expression in FPKM (fragments per kilobase of transcript per million mapped reads) units. (I) Assembly details including chimera risks and repeat sequences predicted by RepeatMasker. (J) Transcript and protein sequences. Fig. 7 View largeDownload slide Browsing the Locus detail page of TodoFirGene with navigation on the left and the body on the right. (A) Summary of transcripts at the same locus. (B) Isoforms aligned with superTranscript constructed by Lace. (C) Summary of annotation with InterPro, Gene Ontology and BLAST for each transcript. (D) Expression in FPKM (fragments per kilobase of transcript per million mapped reads) units for a locus and for all isoforms. Fig. 7 View largeDownload slide Browsing the Locus detail page of TodoFirGene with navigation on the left and the body on the right. (A) Summary of transcripts at the same locus. (B) Isoforms aligned with superTranscript constructed by Lace. (C) Summary of annotation with InterPro, Gene Ontology and BLAST for each transcript. (D) Expression in FPKM (fragments per kilobase of transcript per million mapped reads) units for a locus and for all isoforms. Utility of TodoFirGene Exploring candidate genes for flowering Predicting flowering is important for industrial seed production, not only of A. sachalinensis but also of other tree species which flower sporadically. Variable age of maturity with flowering is observed for A. sachalinensis depending on environmental conditions and the origin of individuals (Kurahashi and Iguchi 1993). Abies sachalinensis from higher altitudes reach maturity and flower earlier than those from lower altitudes (Kurahashi et al. 1993). Artificial pollination experiments between high- and low-altitude genotypes revealed that these reproductive traits must be genetically controlled (Hisamoto and Goto 2017). Flowering is induced by plant hormones, one of which is florigen encoded by Flowering Locus T (FT). In Arabidopsis thaliana, FT is expressed and translated in leaf tissue and transported to the apical meristem, where floral identity genes are activated (Corbesier et al. 2007). To find FT genes in A. sachalinensis, go to Gene Search and enter the keyword ‘Flowering Locus T’. This query returns four entries (four transcripts from three loci), the sequence and annotations of which are available from the download button (Fig. 5). From the Ortholog group column, one locus (AbisacEGm027477) may be specific to A. sachalinensis, while two loci (AbisacEGm026355 and AbisacEGm027876) have common orthologs in all 14 species. A BLAST search is also used to find FT, using, for example, A. thaliana FT available from: https://www.ncbi.nlm.nih.gov/nuccore/AB027504.1? report=fasta (March 27, 2018, date last accessed). Pasting the FASTA-formatted sequence to the ‘Query fasta’ box, selecting the appropriate BLAST program and parameters (blastx and e-value of 1E-50 in the current case) and clicking the ‘Submit’ button will execute BLAST searches, and results are displayed (Fig. 5) with four transcripts from four loci. Pollen is a specific structure in the male flower, and we expect characteristic transcripts from the tissue. To find transcripts with expression values >100 times greater in the male flower compared with the female flower, go to Expression value search and enter, for example, 100 in the fold change box. The result shows 20 genes with the greatest expression difference of 1,715.18 from AbisacEGm065621 (Fig. 5). Clicking the link to the locus detail page for AbisacEGm065621 will show annotations suggesting that the gene may function as chitinase and have nine transcripts. Chitinase has been identified as a pollen allergen in Cryptomeria japonica (Cupressaceae) (Fujimura et al. 2005), a major forestry conifer in Japan. Detecting Pinaceae orthologous genes Although OrthoFinder identified that 5,592 orthogroups are common in 14 species, we focus on orthologs among conifers, especially Pinaceae, where conservation of synteny is widely observed among species in different genera. TodoFirGene reference transcripts were utilized to search orthologs of Pinaceae genes (de Miguel et al. 2015) and we found 2,227 homologous sequences in Pinus pinaster, which has been on the Pinaceae composite map. Among these, 955 had homologous sequences from Picea species, which we assumed to be candidate orthologous genes in A. sachalinensis among Pinaceae (Supplementary Table S5). From congeneric sequences in Abies, 387 A. alba genes were identified as A. sachalinensis orthologous and also as Pinaceae orthologous candidates (Supplementary Table S5). The smaller number of candidate orthologous genes identified in A. alba may come from incomplete coverage of the transcriptome [22,561 transcriptome shotgun assembly (TSA) sequences with an average length of 538 bp]. Orthologous genes will be further confirmed after we carry out linkage mapping of these candidates. de Miguel et al. (2015) identified 513 orthologous unigenes between Pinus and Picea with BLAST homology search and linkage map information. For A. sachalinensis, we have already constructed linkage maps using RAD-Seq (restriction site-associated DNA sequencing)-based markers, and identified several quantitative trait loci (QTLs) (Goto et al. 2017). Genetic markers resulting from RAD-Seq normally have no relationships to gene functions. Also sequence similarity of RAD-Seq between species is expected to be low, suggesting low transferability of these markers to other related species. In contrast, the conservative nature of genic sequences helps to link syntenic regions between related species, especially within Pinaceae, whose level of synteny is relatively high. Linkage mapping with these candidate orthologous genes is now underway. Materials and Methods Plant materials and RNA-Seq A matured Abies sachalinensis individual was selected for the present study. The origin of this tree was the high-elevation zone of Mt. Dairoku (1459 m a.s.l.; 43°20'23''N, 142°37'56''E), while the growing environment was in the lowland. In 1973, open pollinated seeds were collected from a natural stand at 1,250 m, and were sown in a nursery seed bed of the University of Tokyo Hokkaido Forest (UTHF) the following year. In 1978, the seedlings were planted in the exhibition forest at 230 m in the UTHF (43°13'11''N, 142°23'03''E). When planted trees were 37 years old, a healthy and flowering individual was selected to use for the subsequent RNA experiment. RNA was separately extracted from (i) fresh male flower; (i) fresh female flower, (iii) young flushing leaf; and (iv) inner bark tissue, which were collected on a partly cloudy morning (10:00–11:00 h) on May 24, 2010. Approximately 0.5–1.0 g of each fresh tissue was used for RNA extraction. Total RNA of each tissue was extracted following the method described by Le Provost et al. (2007) and stored at –80°C until sequencing analysis. The RNA-Seq experiment was performed by Hokkaido System Science Inc. with 100 bp paired-end libraries run on a HiSeq2000 (Illumina Inc.). We used a separate barcode to distinguish each tissue sample on a single flow cell lane. Bioinformatic analysis Pre-processing of sequencing reads Redundant sequencing reads (reads with the same sequence in a pair) were removed with an in-house script, and then adaptors and poly(A) sequences were removed with cutadapt (Martin 2011). Low-quality sequences were also removed with the in-house program, where reads of length ≥20 bp were retained. Reads with similarities to rRNA/tRNA sequences in A. thaliana and Oryza sativa (downloaded from TAIR and RAP-DP, respectively) were removed according to the BLASTn results (e-value cut-off of 1E-5). If either forward or reverse reads were removed during this pre-process, they were both removed. Remaining reads were re-paired in paired-end form. Assembly, clustering and evaluation of reference transcriptome Pre-processed reads were assembled with Velvet (Zerbino and Birney 2008) and Oases (Schulz et al. 2012) with k-mers (21–95), and all of the resulting transcripts were clustered with the EvidentialGene pipeline (Gilbert 2013). We selected ‘OK main’ and ‘alt’ categories from the output of EvidentialGene and defined the ‘Public set’ of the reference transcriptome for A. sachalinensis (TodoFirGene). Multiple transcripts from the same gene were integrated into a superTranscript with Lace software (Davidson et al. 2017). In order to evaluate the reference transcriptome, we focused on comprehensiveness and the number of erroneous chimeras in the reference. Comprehensiveness was assessed with CEGMA (Parra et al. 2007), in which 248 CEGs were BLASTed against TodoFirGene. Chimeric candidates were detected by FrameDP (Gouzy et al. 2009) with Pabies1.0-all-pep.fa as a reference protein sequence, which was merged from Pabies1.01.0-HC-pep.faa, Pabies1.01.0-MC-pep.faa and Pabies1.01.0-LC-pep.faa on the web site of the Norway Spruce Genome Project (Nystedt et al. 2013). In addition, chimeric transcripts were also detected by a script in optimize_assembler (Yang and Smith 2013) (hereafter referred to as the YS method) with Pabies1.0-all-pep.fa as a reference transcript and an e-value cut-off of 1E-2. If FrameDP identified multiple amino acid sequences within a single transcript, and/or the YS method identified a transcript as chimeric, we assumed the transcripts to have a certain level of risk of chimera. We defined three risk levels of ‘High’, ‘Middle’ and ‘Low’, if both, one and no methods identified a transcript as chimeric, respectively. Annotation of reference transcriptome Reference transcriptome sequences (TodoFirGene) were BLASTed against NR and UniProt (Swiss-Prot). Protein sequences from P. abies (ver. 1.0) and P. trichocarpa (ver. 3.1) were also queried, which were downloaded from ConGenIE (http://congenie.org/start (March 27, 2018, date last accessed)) and phytozome12 (https://phytozome.jgi.doe.gov/pz/portal.html (March 27, 2018, date last accessed)), respectively. As congeneric sequences, A. alba 22,561 TSA sequences (Roschanski et al. 2013) were downloaded and queried by tBLASTx. We applied an e-value cut-off of 1E-5 for all BLAST searches. From the result of the NR BLAST search, taxonomy information on the top hit accession number from each query was extracted with the tax2acc tool (https://github.com/richardmleggett/acc2tax (March 27, 2018, date last accessed)). We used the KEGG database to estimate metabolic pathways in KAAS (Moriya et al. 2007) with eudicot and monocot gene sets. Estimated amino acid sequences by EvidentialGene were queried against the InterPro database, whose output was used to extract GO terms. GO terms were converted into plant GOslim terms by Map2Slim from OWLTools (https://github.com/owlcollab/owltools (March 27, 2018, date last accessed)). Intracellular localization for the amino acid sequences by EvidentialGene was predicted with SignalP (Petersen et al. 2011), ChloroP (Emanuelsson et al. 1999) and WoLF PSORT (Horton et al. 2007) with default parameters. For the analysis with ChloroP, sequences >4,000 amino acids were excluded due to the program limitation. Orthologous sequences were identified by OrthoFinder (Emms and Kelly 2015) with the 14 protein sequence data set listed in Table 5. Levels of gene expression for each tissue were quantified with the RSEM program (Li and Dewey 2011) with default parameters for both transcript and gene levels. We assumed that gene expression in each tissue was proportional to the number of sequencing reads from the tissue. Identification of candidate genetic markers Repeat sequences were analyzed with RepeatMasker (http://www.repeatmasker.org (March 27, 2018, date last accessed)) and MISA (Thiel et al. 2003) programs. In MISA, microsatellite sequences were defined with the minimum numbers of repeats as follows: eight for mono-SSRs, six for di-SSRs, five for tri-SSRs, four for tetra-SSRs, three for penta-SSRs and three for hexa-SSRs. The location of each microsatellite (coding sequence or not) on a transcript was estimated by the ‘bedtools intersect’ (Schulz et al. 2012). SNVs were also detected utilizing BAM alignment files produced in the expression analysis, which were merged into a single file. The merged BAM file was processed with Picard (http://broadinstitute.github.io/picard (March 27, 2018, date last accessed)) to add RG tags, samtools (Li et al. 2009) to sort, and GATK (McKenna et al. 2010) to realign indel regions, and converted into an mpileup file by samtools. Finally, SNVs were called and filtered with the bcftools package (vcfutils.pl varFilter with parameters ‘-d 3 -Q 20’). Putative SNVs were summarized by the vcftools package. Pinaceae ortholog estimation The utility of TodoFirGene was tested by estimating candidates of Pinaceae orthologs in A. sachalinensis by referencing the Pinaceae composite linkage map (de Miguel et al. 2015). Because the genome organization of Pinaceae is conserved among species, we expect orthologous genes in A. sachalinensis to be identified on the Pinaceae composite map. Orthologous genes were first screened by a reciprocal tBLASTx homology search between TodoFirGene and P. pinaster unigenes (v3.0) (Cañas et al. 2014) downloaded from the SustainPineDB (http://www.scbi.uma.es/sustainpinedb/home_page (March 27, 2018, date last accessed)). Homologous sequences were then identified according to de Miguel et al. (2015). The threshold of the reciprocal best BLAST hit had a percentage identity >85%, an e-value <1E-65 and an alignment length >200 bp. We selected candidate Pinaceae orthologous genes in A. sachalinensis if corresponding P. pinaster genes had homologous genes in Picea species. Database system architecture and usability TodoFirGene was implemented on CentOS (6.8) operating systems with an Apache web server (version 2.2.31) and MySQL database server (version 5.6). PHP (version 5.6) was used as the server-side scripting language. JavaScript and HTML5 are used for data visualization with JBrowse (Buels et al. 2016) and JavaScript libraries: vue (https://vuejs.org (March 27, 2018, date last accessed)), Bootstrap (http://getbootstrap.com/ (March 27, 2018, date last accessed)), D3 (http://d3js.org (March 27, 2018, date last accessed)), jQuery (https://jquery.com (March 27, 2018, date last accessed)) and Chart.js (http://www.chartjs.org/ (March 27, 2018, date last accessed)). Summary annotations were based on Automatic assignment of Human Readable Descriptions (AHRD: https://github.com/groupschoof/AHRD (March 27, 2018, date last accessed)). Users can freely access the database without any registration through internet browsers. We confirmed that the following browsers (operating systems) can be used to access TodoFirGene: FireFox (Linux), Chrome (Windows) and Safari (Macintosh). Supplementary Data Supplementary data are available at PCP online. Funding This work was supported by the Japan Society for the Promotion of Science. [KAKENHI grant Nos. 25292081 and 16H02554 to S.G.]; the Ministry of Education, Culture, Sports, Science, and Technology of Japan (MEXT) [Grant-in-Aid for Scientific Research on Innovative Areas (No. 17H05848 to K.Y.)]; MEXT-Supported Program for the Strategic Research Foundation at Private Universities [2014 2018]; the Institute of Science and Technology, Meiji University [Research Project Grant(A) and Research Project Grant(A)or(B) to K.Y.]; Meiji University [Research Funding for the Computational Software Supporting Program to K.Y.] Acknowledgments The authors thank Akio Kurahashi for initially setting up the plant materials, and Daisuke Sakaue and the staff of the UTHF for help with the sample collection and RNA extraction. Computations were partially performed on the NIG supercomputer at ROIS National Institute of Genetics and the supercomputer of AFFRIT, MAFF, Japan. We also thank the Editors and anonymous reviewers, whose comments greatly improved the database. Disclosures The authors have no conflicts of interest to declare. References Buels R., Yao E., Diesh C.M., Hayes R.D., Munoz-Torres M., Helt G., et al.  . ( 2016) JBrowse: a dynamic web platform for genome visualization and analysis. Genome Biol . 17: 66. Google Scholar CrossRef Search ADS PubMed  Cañas R.A., Canales J., Gómez-Maldonado J., Ávila C., Cánovas F.M. ( 2014) Transcriptome analysis in maritime pine using laser capture microdissection and 454 pyrosequencing. Tree Physiol . 34: 1278– 1288. Google Scholar CrossRef Search ADS PubMed  Corbesier L., Vincent C., Jang S., Fornara F., Fan Q., Searle I., et al.  . ( 2007) FT protein movement contributes to long-distance signaling in floral induction of Arabidopsis. Science  316: 1030– 1033. Google Scholar CrossRef Search ADS PubMed  Davidson N.M., Hawkins A.D.K., Oshlack A. ( 2017) SuperTranscripts: a data driven reference for analysis and visualisation of transcriptomes. Genome Biol.  18: 148. Google Scholar CrossRef Search ADS PubMed  de Miguel M., Bartholome J., Ehrenmann F., Murat F., Moriguchi Y., Uchiyama K., et al.  . ( 2015) Evidence of intense chromosomal shuffling during conifer evolution. Genome Biol. Evol.  7: 2799– 2809. Google Scholar PubMed  Eiga S., Sakai A. ( 1984) Altitudinal variation in freezing resistance of Saghalien fir (Abies sachalinensis). Can. J. Bot.  62: 156– 160. Google Scholar CrossRef Search ADS   Eiga S., Sakai A. ( 1987) Regional variation in cold hardiness of Sakhalin fir (Abies sachalinensis Mast.) in Hokkaido, Japan. In Plant Cold Hardiness . Edited by Li P.H. pp. 169– 182. Alan R. Liss, Inc., New York. Emanuelsson O., Nielsen H., von Heijne G. ( 1999) ChloroP, a neural network-based method for predicting chloroplast transit peptides and their cleavage sites. Protein Sci.  8: 978– 984. Google Scholar CrossRef Search ADS PubMed  Emms D.M., Kelly S. ( 2015) OrthoFinder: solving fundamental biases in whole genome comparisons dramatically improves orthogroup inference accuracy. Genome Biol.  16: 157. Google Scholar CrossRef Search ADS PubMed  Fujimura T., Shigeta S., Suwa T., Kawamoto S., Aki T., Masubuchi M., et al.  . ( 2005) Molecular cloning of a class IV chitinase allergen from Japanese cedar (Cryptomeria japonica) pollen and competitive inhibition of its immunoglobulin E-binding capacity by latex C-serum. Clin. Exp. Allergy  35: 234– 243. Google Scholar CrossRef Search ADS PubMed  Gilbert D. ( 2013) Gene-omes built from mRNA seq not genome DNA. In 7th Annual Arthropod Genomics Symposium, Notre Dame. Goto S., Kajiya-Kanegae H., Ishizuka W., Kitamura K., Ueno S., Hisamoto Y., et al.  . ( 2017) Genetic mapping of local adaptation along the altitudinal gradient in Abies sachalinensis. Tree Genet. Genomes  13: 104. Google Scholar CrossRef Search ADS   Gouzy J., Carrere S., Schiex T. ( 2009) FrameDP: sensitive peptide detection on noisy matured sequences. Bioinformatics  25: 670– 671. Google Scholar CrossRef Search ADS PubMed  Hisamoto Y., Goto S. ( 2017) Genetic control of altitudinal variation on female reproduction in Abies sachalinensis revealed by a crossing experiment. J. For. Res . 22: 195– 198. Google Scholar CrossRef Search ADS   Horton P., Park K.J., Obayashi T., Fujita N., Harada H., Adams-Collier C.J., et al.  . ( 2007) WoLF PSORT: protein localization predictor. Nucleic Acids Res . 35: W585– W587. Google Scholar CrossRef Search ADS PubMed  Ishizuka W., Goto S. ( 2012) Modeling intraspecific adaptation of Abies sachalinensis to local altitude and responses to global warming, based on a 36-year reciprocal transplant experiment. Evol. Appl . 5: 229– 244. Google Scholar CrossRef Search ADS PubMed  Ishizuka W., Ono K., Hara T., Goto S. ( 2015) Influence of low- and high-elevation plant genomes on the regulation of autumn cold acclimation in Abies sachalinensis. Front. Plant Sci.  6: 890. Google Scholar CrossRef Search ADS PubMed  Kato R. ( 1952) The vegetation of the University of Tokyo Hokkaido Forest (in Japanese with English summary). Bull. Tokyo Univ. For . 43: 1– 18. Kurahashi A., Iguchi K. ( 1993) The relation between age and first cone setting and tree size of Saghalien fir (Abies sachalinensis) planted tree (in Japanese). Trans. Meet. Hokkaido Branch Jpn. For. Soc.  41: 157– 159. Kurahashi A., Ogasawara S., Iguchi K. ( 1993) Variation in the characters of Saghalien fir (Abies sachalinensis) associated with altitudinal gradients—the growth and flower-setting at the age of nineteen years of the planted offspring families of individuals growing at various altitudes (in Japanese). Trans. Jpn. For. Soc.  104: 417– 420. Le Provost, G., Herrera, R., Paiva, J.A., Chaumeil, P., Salin, F. and Plomion, C. ( 2007) A micromethod for high throughput RNA extraction in forest trees. Biol Res. 40: 291– 297. Li B., Dewey C.N. ( 2011) RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics  12: 323. Google Scholar CrossRef Search ADS PubMed  Li H., Handsaker B., Wysoker A., Fennell T., Ruan J., Homer N., et al.  . ( 2009) The sequence alignment/map format and SAMtools. Bioinformatics  25: 2078– 2079. Google Scholar CrossRef Search ADS PubMed  Li W., Godzik A. ( 2006) Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics  22: 1658– 1659. Google Scholar CrossRef Search ADS PubMed  Martin M. ( 2011) Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet J . 17: 10. Google Scholar CrossRef Search ADS   McKenna A., Hanna M., Banks E., Sivachenko A., Cibulskis K., Kernytsky A., et al.  . ( 2010) The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res . 20: 1297– 1303. Google Scholar CrossRef Search ADS PubMed  Moriya Y., Itoh M., Okuda S., Yoshizawa A.C., Kanehisa M. ( 2007) KAAS: an automatic genome annotation and pathway reconstruction server. Nucleic Acids Res . 35: W182– W185. Google Scholar CrossRef Search ADS PubMed  Nystedt B., Street N.R., Wetterbom A., Zuccolo A., Lin Y.-C., Scofield D.G., et al.  . ( 2013) The Norway spruce genome sequence and conifer genome evolution. Nature  497: 579– 584. Google Scholar CrossRef Search ADS PubMed  Parra G., Bradnam K., Korf I. ( 2007) CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes. Bioinformatics  23: 1061– 1067. Google Scholar CrossRef Search ADS PubMed  Petersen T.N., Brunak S., von Heijne G., Nielsen H. ( 2011) SignalP 4.0: discriminating signal peptides from transmembrane regions. Nat. Methods  8: 785– 786. Google Scholar CrossRef Search ADS PubMed  Roschanski A.M., Fady B., Ziegenhagen B., Liepelt S. ( 2013) Annotation and re-sequencing of genes from de novo transcriptome assembly of Abies alba (Pinaceae). Appl. Plant Sci.  1: 1200179. Google Scholar CrossRef Search ADS   Schulz M.H., Zerbino D.R., Vingron M., Birney E. ( 2012) Oases: robust de novo RNA-seq assembly across the dynamic range of expression levels. Bioinformatics  28: 1086– 1092. Google Scholar CrossRef Search ADS PubMed  Thiel T., Michalek W., Varshney R.K., Graner A. ( 2003) Exploiting EST databases for the development and characterization of gene-derived SSR-markers in barley (Hordeum vulgare L.). Theor. Appl. Genet.  106: 411– 422. Google Scholar CrossRef Search ADS PubMed  Yamazaki T. ( 1995) Pinaceae. In Pteriodophyta and Gymnospermae . Edited by Iwatsuki K., Yamazaki T., Boufford D.E., Ohba H. pp. 266– 277. Kodansha, Tokyo. Yang Y., Smith S.A. ( 2013) Optimizing de novo assembly of short-read RNA-seq data for phylogenomics. BMC Genomics  14: 328. Google Scholar CrossRef Search ADS PubMed  Zerbe P., Chiang A., Yuen M., Hamberger B., Draper J.A., Britton R., et al.  . ( 2012) Bifunctional cis-abienol synthase from Abies balsamea discovered by transcriptome sequencing and its implications for diterpenoid fragrance production. J. Biol. Chem.  287: 12121– 12131. Google Scholar CrossRef Search ADS PubMed  Zerbino D.R., Birney E. ( 2008) Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res . 18: 821– 829. Google Scholar CrossRef Search ADS PubMed  Abbreviations Abbreviations BLAST Basic Local Alignment Search Tool CEG core eukaryotic gene CEGMA core eukaryotic genes mapping approach FPKM fragments per kilobase of transcript per million mapped reads FT Flowering Locus T GO Gene Ontology KAAS KEGG Automatic Annotation Server KEGG Kyoto Encyclopedia of Genes and Genomes RAD restriction-site associated DNA RNA-Seq RNA sequencing SNV single nucleotide variant SSR simple sequence repeat TSA transcriptome shotgun assembly UTHF University of Tokyo Hokkaido Forest Footnotes Footnotes The nucleotide sequence reported in this paper has been submitted to the DDBJ DRA with accession numbers DRA004908–DRA004911 © The Author(s) 2018. Published by Oxford University Press on behalf of Japanese Society of Plant Physiologists. All rights reserved. For permissions, please email: journals.permissions@oup.com This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/about_us/legal/notices) http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Plant and Cell Physiology Oxford University Press

TodoFirGene: Developing Transcriptome Resources for Genetic Analysis of Abies sachalinensis

Loading next page...
 
/lp/ou_press/todofirgene-developing-transcriptome-resources-for-genetic-analysis-of-11Ip5UkODt
Publisher
Oxford University Press
Copyright
© The Author(s) 2018. Published by Oxford University Press on behalf of Japanese Society of Plant Physiologists. All rights reserved. For permissions, please email: journals.permissions@oup.com
ISSN
0032-0781
eISSN
1471-9053
D.O.I.
10.1093/pcp/pcy058
Publisher site
See Article on Publisher Site

Abstract

Abstract Todo-matsu (Abies sachalinensis) is one of the most important forestry species in Hokkaido, Japan and is distributed from near sea level to the alpine zone. Due to its wide spatial distribution, the species adapts to its environment, displaying phenotypes of ecological relevance. In order to identify candidate genes under natural selection, we collected the transcriptome from the female and male flower, leaf and inner bark. De novo assembly with 34.7 Gb of sequencing reads produced 158,542 transcripts from 69,618 loci, whose estimated coverage reached 95.6% of conserved eukaryotic genes. Homology searches against publicly available databases identified 134,190 (84.6%) transcripts with at least one hit. In total, 28,944 simple sequence repeats (SSRs) and 80,758 single nucleotide variants (SNVs) were detected from 23,570 (14.9%) and 25,366 (16.0%) transcripts, which were valuable for use in genetic analysis of the species. All the annotations were included in a relational database, TodoFirGene, which provides an interface for various queries and homology search, and can be accessed at http://plantomics.mind.meiji.ac.jp/todomatsu/. This database hosts not only the A. sachalinensis transcriptome but also links to the proteomes of 13 other species, allowing a comparative genomic study of plant species. Introduction Abies sachalinensis (known as todo-matsu, todo-fir or Sakhalin fir) is a coniferous, wind-pollinated and wind-dispersed species in the Pinaceae family (2n = 24) (Yamazaki 1995). This conifer plays an important role in timber use and forms an important component of forest ecosystems as a dominant climax species of the sub-boreal zone in northern Japan (Kato 1952). The geographic range of this species mainly extends from Hokkaido, the northern island of Japan, to Sakhalin Island. The elevational distribution extends from near sea level to the alpine zone, approximately 1,600 m a.s.l. (Yamazaki 1995). Due to this wide ecological distribution, large phenotypic variations along the environmental gradient were well known, suggesting adaptation to the habitat (Eiga and Sakai 1984, Eiga and Sakai 1987). A reciprocal transplant experiment along the elevational gradient showed that both upslope and downslope transplantation reduced productivity, indicating the home-site advantage of the local population (Ishizuka and Goto 2012). Despite being planted under favorable environmental conditions on a downslope site, plants derived from the upslope population showed inferior growth to those from other sites. Intraspecific adaptation to the local climate is thought to be driven by genome-based control of the responsible traits, such as autumn phenology (Ishizuka et al. 2015). However, the presence of relevant genes and/or their mutations and their potential for local adaptation have not been sufficiently examined for this species. In order to evaluate the genes responsible and their ecological relevance, gene database/catalogs must be an appropriate tool for primary screening for candidate genes, which is, however, lacking for this species and its genus Abies. Transcriptome sequencing or RNA-seq (RNA sequencing) is a method to capture whole expressed genes in a specific tissue. The resulting vast amount of data can be utilized for the analysis of digital gene expression as well as for the construction of transcriptome contigs or transcripts in non-model organisms for which no reference sequences are available. Development of a reference gene set is a cost-effective strategy of first choice for non-model organisms, enabling a detailed downstream genetic study by providing an atlas of the expressed genes. Although sequence information from related species such as A. alba (Roschanski et al. 2013) and A. balsamea (Zerbe et al. 2012) could be transferred, detailed genetic analysis at species level needs reference sequences from the target species. In the present study, our objectives were to develop transcriptome resources for A. sachalinensis, for which no genome resources are available to date. The resulting transcript sequences were annotated to construct a database, TodoFirGene, which will be used to identify candidate genes of ecological relevance in Abies species in future studies. Furthermore, orthologous sequences in 14 species including A. sachalinensis were identified for comparative studies, which reinforce the omics information of gymnosperms and connect researchers from forest sciences with those in comparative bioinformatics and evolutionary sciences. Results and Discussion Construction of the reference transcriptome In total we collected 36.5 Gb of sequencing reads from four tissues, ranging from 7.4 Gb from the inner bark to 10.9 Gb from the female flower (Table 1). After pre-processing, 34.7 Gb (95%) of reads were retained and used for assembly. The assembler, Oases (Schulz et al. 2012), with k-mers from 21 to 95 produced 2,550,215 contigs in total, ranging from 28,383 (k = 95) to 357,191 (k = 21) (Supplementary Table S1), which EvidentialGene (Gilbert 2013) clustered into 158,542 transcripts with an estimated 69,618 loci. Each locus had a mean and maximum of 2.28 and 123 transcripts, respectively. The length of transcripts varied from 180 to 19,482 bp, with an average of 1,560 bp (Fig. 1), which we defined as the ‘Public set’ of the reference transcriptome for A. sachalinensis (TodoFirGene) (Table 2). Table 1 Transcriptome sequencing of A. sachalinensis by HiSeq2000 Sample ID  Tissue  No. of read pairs  Total base output (Gbp)  No. of read pairs after cleaning  Total cleaned base (Gbp)  Im1996  Male flower  42,348,328  8.55  41,601,106  8.32  Im1997  Female flower  54,183,830  10.95  52,807,684  10.56  Im1998  Leaf  47,580,254  9.61  45,458,689  9.10  Im1999  Inner bark  36,607,117  7.39  33,437,314  6.68  Total    180,719,529  36.50  173,304,793  34.66  Sample ID  Tissue  No. of read pairs  Total base output (Gbp)  No. of read pairs after cleaning  Total cleaned base (Gbp)  Im1996  Male flower  42,348,328  8.55  41,601,106  8.32  Im1997  Female flower  54,183,830  10.95  52,807,684  10.56  Im1998  Leaf  47,580,254  9.61  45,458,689  9.10  Im1999  Inner bark  36,607,117  7.39  33,437,314  6.68  Total    180,719,529  36.50  173,304,793  34.66  Table 1 Transcriptome sequencing of A. sachalinensis by HiSeq2000 Sample ID  Tissue  No. of read pairs  Total base output (Gbp)  No. of read pairs after cleaning  Total cleaned base (Gbp)  Im1996  Male flower  42,348,328  8.55  41,601,106  8.32  Im1997  Female flower  54,183,830  10.95  52,807,684  10.56  Im1998  Leaf  47,580,254  9.61  45,458,689  9.10  Im1999  Inner bark  36,607,117  7.39  33,437,314  6.68  Total    180,719,529  36.50  173,304,793  34.66  Sample ID  Tissue  No. of read pairs  Total base output (Gbp)  No. of read pairs after cleaning  Total cleaned base (Gbp)  Im1996  Male flower  42,348,328  8.55  41,601,106  8.32  Im1997  Female flower  54,183,830  10.95  52,807,684  10.56  Im1998  Leaf  47,580,254  9.61  45,458,689  9.10  Im1999  Inner bark  36,607,117  7.39  33,437,314  6.68  Total    180,719,529  36.50  173,304,793  34.66  Table 2 Assembly statistics of the A. sachalinensis transcriptome Number of contigs  158,542  Number of nucleotides (bp)  247,403,210  Number of estimated loci  69,618  N50 (bp)  2,239  Average length (bp)  1,560  No. of coding sequences (CDS)  158,542  Average nucleotide length of CDS (bp)  1,150  Number of contigs  158,542  Number of nucleotides (bp)  247,403,210  Number of estimated loci  69,618  N50 (bp)  2,239  Average length (bp)  1,560  No. of coding sequences (CDS)  158,542  Average nucleotide length of CDS (bp)  1,150  Table 2 Assembly statistics of the A. sachalinensis transcriptome Number of contigs  158,542  Number of nucleotides (bp)  247,403,210  Number of estimated loci  69,618  N50 (bp)  2,239  Average length (bp)  1,560  No. of coding sequences (CDS)  158,542  Average nucleotide length of CDS (bp)  1,150  Number of contigs  158,542  Number of nucleotides (bp)  247,403,210  Number of estimated loci  69,618  N50 (bp)  2,239  Average length (bp)  1,560  No. of coding sequences (CDS)  158,542  Average nucleotide length of CDS (bp)  1,150  Fig. 1 View largeDownload slide Histogram of transcript length for A. sachalinensis. Fig. 1 View largeDownload slide Histogram of transcript length for A. sachalinensis. Evaluation of the reference transcriptome When the comprehensiveness of TodoFirGene was evaluated by CEGMA (core eukaryotic gene mapping approach) (Parra et al. 2007), 237 (95.6%) of 248 core eukaryotic genes (CEGs) were found. As for chimera detection, the method of Yang and Smith (YS method) with optimize_assembler (Yang and Smith 2013) identified 1,212 (0.76%) transcripts as chimeric, while FrameDP (Gouzy et al. 2009) identified 152,715 protein sequences from 158,542 transcripts (TodoFirGene), resulting in 14,418 (9.1%) of the reference with multiple protein sequences estimated. In all, 748 (0.47%) transcripts were identified by both YS and FrameDP methods, and ‘High’ levels of chimeric risk were suggested, while 14,134 (8.9%) transcripts were possible chimeras identified by either of the two methods and labeled as ‘Middle’ chimeric risk. In our preliminary assembly and clustering of the transcriptome of A. sachalinensis, we used Oases with k-mers 31, 41, 51, 61 and 71. After splitting contig sequences at ‘N’ sites, sequences were then merged with the Oases ‘-merge’ option and clustered with CD-HIT-EST (Li and Godzik 2006), which produced 223,621 transcripts with an estimated 71,707 loci. This portion of the transcriptome covered as much as 99.6% of 248 CEGs; however, 6,631 (4.5%) transcripts were estimated as possible chimeras by the YS method, showing that the rate of chimeric sequences (4.5%) was much higher than that (0.90%) of the ‘Public set’ (Table 3). The ‘-merge’ option by Oases may contribute to a rise in the rate of chimeric transcripts (Yang and Smith 2013). In contrast, when we selected ‘OK main’ only and excluded ‘alt’ categories from the output of EvidentialGene, the resulting set contained 53,798 transcripts from 33,406 loci. Unfortunately, this smallest and compact set of the transcriptome covers 91.9% of 248 CEGs, and 543 (1.35%) transcripts are identified as possible chimeras by the YS method. We therefore considered the ‘Public set’ suitable for defining the reference transcriptome of A. sachalinensis (TodoFirGene). Table 3 Evaluation of transcriptome assembly for A. sachalinensis, with three sets of transcriptome assembly considered Set  No. of transcripts  Average length (bp)  Comprehensiveness (%)  Chimeric sequence (%)  Global  223,621  1,325  99.6  4.50  Essential  53,798  1,324  91.9  1.35  Public  158,542  1,560  95.6  0.90  Set  No. of transcripts  Average length (bp)  Comprehensiveness (%)  Chimeric sequence (%)  Global  223,621  1,325  99.6  4.50  Essential  53,798  1,324  91.9  1.35  Public  158,542  1,560  95.6  0.90  Table 3 Evaluation of transcriptome assembly for A. sachalinensis, with three sets of transcriptome assembly considered Set  No. of transcripts  Average length (bp)  Comprehensiveness (%)  Chimeric sequence (%)  Global  223,621  1,325  99.6  4.50  Essential  53,798  1,324  91.9  1.35  Public  158,542  1,560  95.6  0.90  Set  No. of transcripts  Average length (bp)  Comprehensiveness (%)  Chimeric sequence (%)  Global  223,621  1,325  99.6  4.50  Essential  53,798  1,324  91.9  1.35  Public  158,542  1,560  95.6  0.90  Annotation of TodoFirGene Blast searches for the TodoFirGene transcriptome identified a minimum of 98,605 (62.2%) and a maximum of 124,381 (78.4%) hits against UniProt (Swiss-Prot) and Picea abies protein database sequences, respectively (Table 4), totaling 136,290 (86.0%) transcripts with at least one hit. When taxonomy information in the NR (non-redundant) database entry was considered, we found that 3,619 (2.3%) transcripts may come from organisms other than A. sachalinensis (Supplementary Table S2), which may co-occur with the species. InterProScan added 1,768,327 annotations for 111,395 (70.3%) putative proteome estimated by EvidentialGene (Table 4). In total, 1,852 Gene Ontology (GO) and 85 GOslim terms were extracted, which could be distributed among 83,969 (53.0%) sequences. According to the KEGG Automatic Annotation Server (KAAS), 29,751 (18.8%) of transcripts were mapped to 3,314 Kyoto Encyclopedia of Genes and Genomes (KEGG) entries, with disease resistance protein RPS2 (K13459) the most frequent, of which 1,507 (0.95%) of transcripts were included. The average number of sequences within a single KEGG ortholog group was 23.1. Analysis of intracellular localization with the SignalP, ChloroP and WoLF PSORT programs identified, respectively, 13,041 (8.2%), 20,383 (12.9%) and 158,500 (100.0%) putative reference proteins with intracellular localization signals. Table 4 Annotation statistics of TodoFirGene Database/character  No. of transcripts with hits/relation  Percentage  NR  120,758  76.2%  UniProt  98,605  62.2%  Picea abies protein  124,381  78.4%  Populus trichocarpa protein  106,670  67.3%  Abies alba nucleotide  114,038  71.9%  KEGG  25,248  15.9%  InterPro  111,395  70.3%      Coils  19,026  12.0%      Gene3D  80,632  50.9%      Hamap  1,849  1.2%      PIRSF  4,097  2.6%      PRINTS  20,968  13.2%      Pfam  99,768  62.9%      ProDom  834  0.5%      ProSitePatterns  24,181  15.3%      ProSiteProfiles  42,074  26.5%      SMART  32,295  20.4%      SUPERFAMILY  84,869  53.5%      TIGRFAM  10,845  6.8%      Gene ontology  80,406  50.7%  ChloroP  13,041  8.2%  SignalP  20,383  12.9%  WoLF PSORT  158,500  100.0%  RepeatMasker  24,050  15.2%  MISA (Microsatellites)  23,570  14.9%  SNV/MNV  94,755  59.8%  Database/character  No. of transcripts with hits/relation  Percentage  NR  120,758  76.2%  UniProt  98,605  62.2%  Picea abies protein  124,381  78.4%  Populus trichocarpa protein  106,670  67.3%  Abies alba nucleotide  114,038  71.9%  KEGG  25,248  15.9%  InterPro  111,395  70.3%      Coils  19,026  12.0%      Gene3D  80,632  50.9%      Hamap  1,849  1.2%      PIRSF  4,097  2.6%      PRINTS  20,968  13.2%      Pfam  99,768  62.9%      ProDom  834  0.5%      ProSitePatterns  24,181  15.3%      ProSiteProfiles  42,074  26.5%      SMART  32,295  20.4%      SUPERFAMILY  84,869  53.5%      TIGRFAM  10,845  6.8%      Gene ontology  80,406  50.7%  ChloroP  13,041  8.2%  SignalP  20,383  12.9%  WoLF PSORT  158,500  100.0%  RepeatMasker  24,050  15.2%  MISA (Microsatellites)  23,570  14.9%  SNV/MNV  94,755  59.8%  Table 4 Annotation statistics of TodoFirGene Database/character  No. of transcripts with hits/relation  Percentage  NR  120,758  76.2%  UniProt  98,605  62.2%  Picea abies protein  124,381  78.4%  Populus trichocarpa protein  106,670  67.3%  Abies alba nucleotide  114,038  71.9%  KEGG  25,248  15.9%  InterPro  111,395  70.3%      Coils  19,026  12.0%      Gene3D  80,632  50.9%      Hamap  1,849  1.2%      PIRSF  4,097  2.6%      PRINTS  20,968  13.2%      Pfam  99,768  62.9%      ProDom  834  0.5%      ProSitePatterns  24,181  15.3%      ProSiteProfiles  42,074  26.5%      SMART  32,295  20.4%      SUPERFAMILY  84,869  53.5%      TIGRFAM  10,845  6.8%      Gene ontology  80,406  50.7%  ChloroP  13,041  8.2%  SignalP  20,383  12.9%  WoLF PSORT  158,500  100.0%  RepeatMasker  24,050  15.2%  MISA (Microsatellites)  23,570  14.9%  SNV/MNV  94,755  59.8%  Database/character  No. of transcripts with hits/relation  Percentage  NR  120,758  76.2%  UniProt  98,605  62.2%  Picea abies protein  124,381  78.4%  Populus trichocarpa protein  106,670  67.3%  Abies alba nucleotide  114,038  71.9%  KEGG  25,248  15.9%  InterPro  111,395  70.3%      Coils  19,026  12.0%      Gene3D  80,632  50.9%      Hamap  1,849  1.2%      PIRSF  4,097  2.6%      PRINTS  20,968  13.2%      Pfam  99,768  62.9%      ProDom  834  0.5%      ProSitePatterns  24,181  15.3%      ProSiteProfiles  42,074  26.5%      SMART  32,295  20.4%      SUPERFAMILY  84,869  53.5%      TIGRFAM  10,845  6.8%      Gene ontology  80,406  50.7%  ChloroP  13,041  8.2%  SignalP  20,383  12.9%  WoLF PSORT  158,500  100.0%  RepeatMasker  24,050  15.2%  MISA (Microsatellites)  23,570  14.9%  SNV/MNV  94,755  59.8%  In ortholog analysis with a 14-plant proteome set (Table 5), 212,540 orthogroups were identified, 59,553 (28.0%) of which contained at least one A. sachalinensis sequence, while none of the sequences was related to the remaining 152,987 (72.0%) orthogroups (Fig. 2). When we focused on orthogroups with A. sachalinensis sequences, those with only A. sachalinensis (unique orthogroup) were the most common [46,280 (21.8%) of the total orthogroup], followed by orthogroups with all 14 species [5,592 (2.6%) common orthogroups]. This may result from a unique character of A. sachalinensis and a universal feature of the plant transcriptome. The unique orthogroup contained 47,949 transcripts from TodoFirGene, 32,899 (68.6%) of which had no hits against the NR database, while the common orthogroup contained 53,205 transcripts, only 581 (1.1%) of which had no hits against the NR database. Table 5 Data sources used for ortholog detection by orthofinder Species  Gene models and version  Source  Abies sachalinensis  –  TodoFirGene  Arabidopsis thaliana  TAIR10  TAIR  Glycine max  Wm82.a2.v1  Phytozome  Medicago truncatula  Mt4.0v1  Medicago Truncatula Genome Project v4.0  Nicotiana tabacum  TN90_AYMY-SS  SGN  Oryza sativa  RAP-DB 2015-03-31  RAP-DB  Physcomitrella patens  v3.1  Phytozome  Picea abies  v1.0  Congenie  Populus trichocarpa  v3.1  Phytozome  Solanum lycopersicum  ITAG2.4  SGN  Solanum tuberosum  V3.4  PGSC  Sorghum bicolor  v2.1  Phytozome  Vitis vinifera  V2.1  Grape genome database  Zea mays  AGPv3.23  Phytozome  Species  Gene models and version  Source  Abies sachalinensis  –  TodoFirGene  Arabidopsis thaliana  TAIR10  TAIR  Glycine max  Wm82.a2.v1  Phytozome  Medicago truncatula  Mt4.0v1  Medicago Truncatula Genome Project v4.0  Nicotiana tabacum  TN90_AYMY-SS  SGN  Oryza sativa  RAP-DB 2015-03-31  RAP-DB  Physcomitrella patens  v3.1  Phytozome  Picea abies  v1.0  Congenie  Populus trichocarpa  v3.1  Phytozome  Solanum lycopersicum  ITAG2.4  SGN  Solanum tuberosum  V3.4  PGSC  Sorghum bicolor  v2.1  Phytozome  Vitis vinifera  V2.1  Grape genome database  Zea mays  AGPv3.23  Phytozome  Table 5 Data sources used for ortholog detection by orthofinder Species  Gene models and version  Source  Abies sachalinensis  –  TodoFirGene  Arabidopsis thaliana  TAIR10  TAIR  Glycine max  Wm82.a2.v1  Phytozome  Medicago truncatula  Mt4.0v1  Medicago Truncatula Genome Project v4.0  Nicotiana tabacum  TN90_AYMY-SS  SGN  Oryza sativa  RAP-DB 2015-03-31  RAP-DB  Physcomitrella patens  v3.1  Phytozome  Picea abies  v1.0  Congenie  Populus trichocarpa  v3.1  Phytozome  Solanum lycopersicum  ITAG2.4  SGN  Solanum tuberosum  V3.4  PGSC  Sorghum bicolor  v2.1  Phytozome  Vitis vinifera  V2.1  Grape genome database  Zea mays  AGPv3.23  Phytozome  Species  Gene models and version  Source  Abies sachalinensis  –  TodoFirGene  Arabidopsis thaliana  TAIR10  TAIR  Glycine max  Wm82.a2.v1  Phytozome  Medicago truncatula  Mt4.0v1  Medicago Truncatula Genome Project v4.0  Nicotiana tabacum  TN90_AYMY-SS  SGN  Oryza sativa  RAP-DB 2015-03-31  RAP-DB  Physcomitrella patens  v3.1  Phytozome  Picea abies  v1.0  Congenie  Populus trichocarpa  v3.1  Phytozome  Solanum lycopersicum  ITAG2.4  SGN  Solanum tuberosum  V3.4  PGSC  Sorghum bicolor  v2.1  Phytozome  Vitis vinifera  V2.1  Grape genome database  Zea mays  AGPv3.23  Phytozome  Fig. 2 View largeDownload slide Ortholog analysis with 14 proteome sets. The x-axis indicates the number of species in an orthogroup, and the y-axis displays the number of genes (transcripts) included in the corresponding cluster category. The abbreviations are as follows: As, Abies sachalinensis; At, Arabidopsis thaliana; Gm, Glycine max; Mt, Medicago truncatula; Nt, Nicotiana tabacum; Os, Oryza sativa; Pa, Picea abies; Pp, Physcomitrella patens; Pt, Populus trichocarpa; Sb, Sorghum bicolor; Sl, Solanum lycopercicum; St, Solanum tuberosum; and Vv, Vitis vinifera. Fig. 2 View largeDownload slide Ortholog analysis with 14 proteome sets. The x-axis indicates the number of species in an orthogroup, and the y-axis displays the number of genes (transcripts) included in the corresponding cluster category. The abbreviations are as follows: As, Abies sachalinensis; At, Arabidopsis thaliana; Gm, Glycine max; Mt, Medicago truncatula; Nt, Nicotiana tabacum; Os, Oryza sativa; Pa, Picea abies; Pp, Physcomitrella patens; Pt, Populus trichocarpa; Sb, Sorghum bicolor; Sl, Solanum lycopercicum; St, Solanum tuberosum; and Vv, Vitis vinifera. Digital expression analysis with RSEM (RNA-Seq by expectation-maximization) identified candidate transcripts that are expressed specifically in each tissue (Fig. 3). Inner bark was the most characteristic tissue in terms of gene expression, because 10,373 transcripts or 4,793 genes were uniquely expressed in this tissue. In contrast, leaf tissue contained only 3,773 and 770 leaf-specific transcripts and genes, respectively. Fig. 3 View largeDownload slide Venn diagram for read distributions for each tissue which were mapped onto TodoFirGene. The figures beneath the tissue name represent the number of transcripts that have FPKM (fragments per kilobase of transcript per million mapped reads) >0. Levels of expression were quantified for each transcript (a) and gene (b). Fig. 3 View largeDownload slide Venn diagram for read distributions for each tissue which were mapped onto TodoFirGene. The figures beneath the tissue name represent the number of transcripts that have FPKM (fragments per kilobase of transcript per million mapped reads) >0. Levels of expression were quantified for each transcript (a) and gene (b). As for repeat sequences, RepeatMasker masked 1,452,758 bp, which amounted to 0.59% of the total transcript sequences (Supplementary Table S3) and was distributed among 24,050 (15.2%) transcripts. Among the repeat sequences, simple sequence repeats (SSRs) were the most frequent, and RepeatMasker identified simple sequences or low-complexity sequences in 23,467 (14.8%) transcripts. The MISA program also discovered 28,944 SSRs in 23,570 (14.9%) of transcripts. The most frequent repeats were mono-SSRs, followed by di-SSRs (Fig. 4). In the coding region, the frequency of tri- and hexanucleotide repeats was high, where gain or loss of other nucleotide repeats caused frameshift mutations, resulting in protein malfunction. When we exclude mononucleotide repeats, 17,008 repeats were found in 13,780 (8.7%) transcripts. The most frequent di-SSR and tri-SSR motifs were AT and AGG, respectively (Supplementary Fig. S1). Fig. 4 View largeDownload slide Frequency distribution for each microsatellite motif. Fig. 4 View largeDownload slide Frequency distribution for each microsatellite motif. Candidate single nucleotide variants (SNVs) were detected from 25,366 (16.0%) transcripts totaling 80,758 SNVs (3.2 SNVs/transcript on average). The ratio of transition to transversion mutations was 1.51 (Supplementary Table S4). Transcriptome database: TodoFirGene We have constructed a transcriptome database, TodoFirGene (Fig. 5), where most of the results of our analysis were loaded and are accessible through the Internet under the following URL: http://plantomics.mind.meiji.ac.jp/todomatsu/. The database can be queried by either keywords, sequences or expression levels from the front page (Fig. 5). On the query results page, links to detailed information for transcripts/loci are presented. The transcript detail page (Fig. 6) displays the transcript and annotation summary, followed by each annotation detail for KEGG orthology and pathway, InterProScan (including GO) and BLAST search results against NR and UniProt (Swiss-Prot) databases, protein sequences from spruce (P. abies) and poplar (Populus trichocarpa), and transcriptome sequences from silver fir (A. alba) databases. Further annotations were added for intracellular localization and orthologous groups of predicted proteins. For each sequence in orthogroups, the following links to the PODC (Plant Omics Data Center) will display the levels of gene expression in a specified species. The detailed expression for four tissues in A. sachalinensis is shown as a bar graph on the same transcript detail page as well as in JBrowse display, which includes unique read mapping (multiply mapped reads are not shown), and annotations such as protein translations, BLAST results, repeat sequences (microsatellites) and SNVs. Other information includes the chimera risk of the transcript, the nucleotide sequence and predicted protein sequence. For loci with multiple transcripts, the locus detail page (Fig. 7) is convenient, where superTranscript is constructed to which multiple transcripts are visually aligned. Locus and transcript level expression are also shown in a bar graph. The TodoFirGene website has a bulk download page (http://plantomics.mind.meiji.ac.jp/todomatsu/download.html), where transcript and coding sequences and predicted protein sequences are available. Pairwise comparisons of gene expression among four tissues (male flower, female flower, young flushing leaf and inner bark) in terms of FPKM (fragments per kilobase of transcript per million mapped reads) are shown in the Gene expression atlas page (http://plantomics.mind.meiji.ac.jp/todomatsu/exp_atlas.html), where the values of FPKM for transcript and locus level can be downloaded (Supplementary Fig. S2). Fig. 5 View largeDownload slide TodoFirGene database screen shot displaying the query page and response for Gene (key words) search, BLAST and Expression value search. Fig. 5 View largeDownload slide TodoFirGene database screen shot displaying the query page and response for Gene (key words) search, BLAST and Expression value search. Fig. 6 View largeDownload slide Browsing the Transcript detail page of TodoFirGene with the navigation on the left and the body on the right. (A) Summary for a transcript and annotation. (B) KEGG orthology and pathway. (C) InterProScan including protein families, domains and other signatures. (D) Gene Ontology and GO slim terms for plants. (E) BLAST results against NR, UniProt (Swiss-Prot), spruce and poplar protein databases. (F) Intracellular localization predicted by ChloroP, SignalP and WoLF PSORT. (G) Orthologous groups predicted by orthofinder. (H) Expression in FPKM (fragments per kilobase of transcript per million mapped reads) units. (I) Assembly details including chimera risks and repeat sequences predicted by RepeatMasker. (J) Transcript and protein sequences. Fig. 6 View largeDownload slide Browsing the Transcript detail page of TodoFirGene with the navigation on the left and the body on the right. (A) Summary for a transcript and annotation. (B) KEGG orthology and pathway. (C) InterProScan including protein families, domains and other signatures. (D) Gene Ontology and GO slim terms for plants. (E) BLAST results against NR, UniProt (Swiss-Prot), spruce and poplar protein databases. (F) Intracellular localization predicted by ChloroP, SignalP and WoLF PSORT. (G) Orthologous groups predicted by orthofinder. (H) Expression in FPKM (fragments per kilobase of transcript per million mapped reads) units. (I) Assembly details including chimera risks and repeat sequences predicted by RepeatMasker. (J) Transcript and protein sequences. Fig. 7 View largeDownload slide Browsing the Locus detail page of TodoFirGene with navigation on the left and the body on the right. (A) Summary of transcripts at the same locus. (B) Isoforms aligned with superTranscript constructed by Lace. (C) Summary of annotation with InterPro, Gene Ontology and BLAST for each transcript. (D) Expression in FPKM (fragments per kilobase of transcript per million mapped reads) units for a locus and for all isoforms. Fig. 7 View largeDownload slide Browsing the Locus detail page of TodoFirGene with navigation on the left and the body on the right. (A) Summary of transcripts at the same locus. (B) Isoforms aligned with superTranscript constructed by Lace. (C) Summary of annotation with InterPro, Gene Ontology and BLAST for each transcript. (D) Expression in FPKM (fragments per kilobase of transcript per million mapped reads) units for a locus and for all isoforms. Utility of TodoFirGene Exploring candidate genes for flowering Predicting flowering is important for industrial seed production, not only of A. sachalinensis but also of other tree species which flower sporadically. Variable age of maturity with flowering is observed for A. sachalinensis depending on environmental conditions and the origin of individuals (Kurahashi and Iguchi 1993). Abies sachalinensis from higher altitudes reach maturity and flower earlier than those from lower altitudes (Kurahashi et al. 1993). Artificial pollination experiments between high- and low-altitude genotypes revealed that these reproductive traits must be genetically controlled (Hisamoto and Goto 2017). Flowering is induced by plant hormones, one of which is florigen encoded by Flowering Locus T (FT). In Arabidopsis thaliana, FT is expressed and translated in leaf tissue and transported to the apical meristem, where floral identity genes are activated (Corbesier et al. 2007). To find FT genes in A. sachalinensis, go to Gene Search and enter the keyword ‘Flowering Locus T’. This query returns four entries (four transcripts from three loci), the sequence and annotations of which are available from the download button (Fig. 5). From the Ortholog group column, one locus (AbisacEGm027477) may be specific to A. sachalinensis, while two loci (AbisacEGm026355 and AbisacEGm027876) have common orthologs in all 14 species. A BLAST search is also used to find FT, using, for example, A. thaliana FT available from: https://www.ncbi.nlm.nih.gov/nuccore/AB027504.1? report=fasta (March 27, 2018, date last accessed). Pasting the FASTA-formatted sequence to the ‘Query fasta’ box, selecting the appropriate BLAST program and parameters (blastx and e-value of 1E-50 in the current case) and clicking the ‘Submit’ button will execute BLAST searches, and results are displayed (Fig. 5) with four transcripts from four loci. Pollen is a specific structure in the male flower, and we expect characteristic transcripts from the tissue. To find transcripts with expression values >100 times greater in the male flower compared with the female flower, go to Expression value search and enter, for example, 100 in the fold change box. The result shows 20 genes with the greatest expression difference of 1,715.18 from AbisacEGm065621 (Fig. 5). Clicking the link to the locus detail page for AbisacEGm065621 will show annotations suggesting that the gene may function as chitinase and have nine transcripts. Chitinase has been identified as a pollen allergen in Cryptomeria japonica (Cupressaceae) (Fujimura et al. 2005), a major forestry conifer in Japan. Detecting Pinaceae orthologous genes Although OrthoFinder identified that 5,592 orthogroups are common in 14 species, we focus on orthologs among conifers, especially Pinaceae, where conservation of synteny is widely observed among species in different genera. TodoFirGene reference transcripts were utilized to search orthologs of Pinaceae genes (de Miguel et al. 2015) and we found 2,227 homologous sequences in Pinus pinaster, which has been on the Pinaceae composite map. Among these, 955 had homologous sequences from Picea species, which we assumed to be candidate orthologous genes in A. sachalinensis among Pinaceae (Supplementary Table S5). From congeneric sequences in Abies, 387 A. alba genes were identified as A. sachalinensis orthologous and also as Pinaceae orthologous candidates (Supplementary Table S5). The smaller number of candidate orthologous genes identified in A. alba may come from incomplete coverage of the transcriptome [22,561 transcriptome shotgun assembly (TSA) sequences with an average length of 538 bp]. Orthologous genes will be further confirmed after we carry out linkage mapping of these candidates. de Miguel et al. (2015) identified 513 orthologous unigenes between Pinus and Picea with BLAST homology search and linkage map information. For A. sachalinensis, we have already constructed linkage maps using RAD-Seq (restriction site-associated DNA sequencing)-based markers, and identified several quantitative trait loci (QTLs) (Goto et al. 2017). Genetic markers resulting from RAD-Seq normally have no relationships to gene functions. Also sequence similarity of RAD-Seq between species is expected to be low, suggesting low transferability of these markers to other related species. In contrast, the conservative nature of genic sequences helps to link syntenic regions between related species, especially within Pinaceae, whose level of synteny is relatively high. Linkage mapping with these candidate orthologous genes is now underway. Materials and Methods Plant materials and RNA-Seq A matured Abies sachalinensis individual was selected for the present study. The origin of this tree was the high-elevation zone of Mt. Dairoku (1459 m a.s.l.; 43°20'23''N, 142°37'56''E), while the growing environment was in the lowland. In 1973, open pollinated seeds were collected from a natural stand at 1,250 m, and were sown in a nursery seed bed of the University of Tokyo Hokkaido Forest (UTHF) the following year. In 1978, the seedlings were planted in the exhibition forest at 230 m in the UTHF (43°13'11''N, 142°23'03''E). When planted trees were 37 years old, a healthy and flowering individual was selected to use for the subsequent RNA experiment. RNA was separately extracted from (i) fresh male flower; (i) fresh female flower, (iii) young flushing leaf; and (iv) inner bark tissue, which were collected on a partly cloudy morning (10:00–11:00 h) on May 24, 2010. Approximately 0.5–1.0 g of each fresh tissue was used for RNA extraction. Total RNA of each tissue was extracted following the method described by Le Provost et al. (2007) and stored at –80°C until sequencing analysis. The RNA-Seq experiment was performed by Hokkaido System Science Inc. with 100 bp paired-end libraries run on a HiSeq2000 (Illumina Inc.). We used a separate barcode to distinguish each tissue sample on a single flow cell lane. Bioinformatic analysis Pre-processing of sequencing reads Redundant sequencing reads (reads with the same sequence in a pair) were removed with an in-house script, and then adaptors and poly(A) sequences were removed with cutadapt (Martin 2011). Low-quality sequences were also removed with the in-house program, where reads of length ≥20 bp were retained. Reads with similarities to rRNA/tRNA sequences in A. thaliana and Oryza sativa (downloaded from TAIR and RAP-DP, respectively) were removed according to the BLASTn results (e-value cut-off of 1E-5). If either forward or reverse reads were removed during this pre-process, they were both removed. Remaining reads were re-paired in paired-end form. Assembly, clustering and evaluation of reference transcriptome Pre-processed reads were assembled with Velvet (Zerbino and Birney 2008) and Oases (Schulz et al. 2012) with k-mers (21–95), and all of the resulting transcripts were clustered with the EvidentialGene pipeline (Gilbert 2013). We selected ‘OK main’ and ‘alt’ categories from the output of EvidentialGene and defined the ‘Public set’ of the reference transcriptome for A. sachalinensis (TodoFirGene). Multiple transcripts from the same gene were integrated into a superTranscript with Lace software (Davidson et al. 2017). In order to evaluate the reference transcriptome, we focused on comprehensiveness and the number of erroneous chimeras in the reference. Comprehensiveness was assessed with CEGMA (Parra et al. 2007), in which 248 CEGs were BLASTed against TodoFirGene. Chimeric candidates were detected by FrameDP (Gouzy et al. 2009) with Pabies1.0-all-pep.fa as a reference protein sequence, which was merged from Pabies1.01.0-HC-pep.faa, Pabies1.01.0-MC-pep.faa and Pabies1.01.0-LC-pep.faa on the web site of the Norway Spruce Genome Project (Nystedt et al. 2013). In addition, chimeric transcripts were also detected by a script in optimize_assembler (Yang and Smith 2013) (hereafter referred to as the YS method) with Pabies1.0-all-pep.fa as a reference transcript and an e-value cut-off of 1E-2. If FrameDP identified multiple amino acid sequences within a single transcript, and/or the YS method identified a transcript as chimeric, we assumed the transcripts to have a certain level of risk of chimera. We defined three risk levels of ‘High’, ‘Middle’ and ‘Low’, if both, one and no methods identified a transcript as chimeric, respectively. Annotation of reference transcriptome Reference transcriptome sequences (TodoFirGene) were BLASTed against NR and UniProt (Swiss-Prot). Protein sequences from P. abies (ver. 1.0) and P. trichocarpa (ver. 3.1) were also queried, which were downloaded from ConGenIE (http://congenie.org/start (March 27, 2018, date last accessed)) and phytozome12 (https://phytozome.jgi.doe.gov/pz/portal.html (March 27, 2018, date last accessed)), respectively. As congeneric sequences, A. alba 22,561 TSA sequences (Roschanski et al. 2013) were downloaded and queried by tBLASTx. We applied an e-value cut-off of 1E-5 for all BLAST searches. From the result of the NR BLAST search, taxonomy information on the top hit accession number from each query was extracted with the tax2acc tool (https://github.com/richardmleggett/acc2tax (March 27, 2018, date last accessed)). We used the KEGG database to estimate metabolic pathways in KAAS (Moriya et al. 2007) with eudicot and monocot gene sets. Estimated amino acid sequences by EvidentialGene were queried against the InterPro database, whose output was used to extract GO terms. GO terms were converted into plant GOslim terms by Map2Slim from OWLTools (https://github.com/owlcollab/owltools (March 27, 2018, date last accessed)). Intracellular localization for the amino acid sequences by EvidentialGene was predicted with SignalP (Petersen et al. 2011), ChloroP (Emanuelsson et al. 1999) and WoLF PSORT (Horton et al. 2007) with default parameters. For the analysis with ChloroP, sequences >4,000 amino acids were excluded due to the program limitation. Orthologous sequences were identified by OrthoFinder (Emms and Kelly 2015) with the 14 protein sequence data set listed in Table 5. Levels of gene expression for each tissue were quantified with the RSEM program (Li and Dewey 2011) with default parameters for both transcript and gene levels. We assumed that gene expression in each tissue was proportional to the number of sequencing reads from the tissue. Identification of candidate genetic markers Repeat sequences were analyzed with RepeatMasker (http://www.repeatmasker.org (March 27, 2018, date last accessed)) and MISA (Thiel et al. 2003) programs. In MISA, microsatellite sequences were defined with the minimum numbers of repeats as follows: eight for mono-SSRs, six for di-SSRs, five for tri-SSRs, four for tetra-SSRs, three for penta-SSRs and three for hexa-SSRs. The location of each microsatellite (coding sequence or not) on a transcript was estimated by the ‘bedtools intersect’ (Schulz et al. 2012). SNVs were also detected utilizing BAM alignment files produced in the expression analysis, which were merged into a single file. The merged BAM file was processed with Picard (http://broadinstitute.github.io/picard (March 27, 2018, date last accessed)) to add RG tags, samtools (Li et al. 2009) to sort, and GATK (McKenna et al. 2010) to realign indel regions, and converted into an mpileup file by samtools. Finally, SNVs were called and filtered with the bcftools package (vcfutils.pl varFilter with parameters ‘-d 3 -Q 20’). Putative SNVs were summarized by the vcftools package. Pinaceae ortholog estimation The utility of TodoFirGene was tested by estimating candidates of Pinaceae orthologs in A. sachalinensis by referencing the Pinaceae composite linkage map (de Miguel et al. 2015). Because the genome organization of Pinaceae is conserved among species, we expect orthologous genes in A. sachalinensis to be identified on the Pinaceae composite map. Orthologous genes were first screened by a reciprocal tBLASTx homology search between TodoFirGene and P. pinaster unigenes (v3.0) (Cañas et al. 2014) downloaded from the SustainPineDB (http://www.scbi.uma.es/sustainpinedb/home_page (March 27, 2018, date last accessed)). Homologous sequences were then identified according to de Miguel et al. (2015). The threshold of the reciprocal best BLAST hit had a percentage identity >85%, an e-value <1E-65 and an alignment length >200 bp. We selected candidate Pinaceae orthologous genes in A. sachalinensis if corresponding P. pinaster genes had homologous genes in Picea species. Database system architecture and usability TodoFirGene was implemented on CentOS (6.8) operating systems with an Apache web server (version 2.2.31) and MySQL database server (version 5.6). PHP (version 5.6) was used as the server-side scripting language. JavaScript and HTML5 are used for data visualization with JBrowse (Buels et al. 2016) and JavaScript libraries: vue (https://vuejs.org (March 27, 2018, date last accessed)), Bootstrap (http://getbootstrap.com/ (March 27, 2018, date last accessed)), D3 (http://d3js.org (March 27, 2018, date last accessed)), jQuery (https://jquery.com (March 27, 2018, date last accessed)) and Chart.js (http://www.chartjs.org/ (March 27, 2018, date last accessed)). Summary annotations were based on Automatic assignment of Human Readable Descriptions (AHRD: https://github.com/groupschoof/AHRD (March 27, 2018, date last accessed)). Users can freely access the database without any registration through internet browsers. We confirmed that the following browsers (operating systems) can be used to access TodoFirGene: FireFox (Linux), Chrome (Windows) and Safari (Macintosh). Supplementary Data Supplementary data are available at PCP online. Funding This work was supported by the Japan Society for the Promotion of Science. [KAKENHI grant Nos. 25292081 and 16H02554 to S.G.]; the Ministry of Education, Culture, Sports, Science, and Technology of Japan (MEXT) [Grant-in-Aid for Scientific Research on Innovative Areas (No. 17H05848 to K.Y.)]; MEXT-Supported Program for the Strategic Research Foundation at Private Universities [2014 2018]; the Institute of Science and Technology, Meiji University [Research Project Grant(A) and Research Project Grant(A)or(B) to K.Y.]; Meiji University [Research Funding for the Computational Software Supporting Program to K.Y.] Acknowledgments The authors thank Akio Kurahashi for initially setting up the plant materials, and Daisuke Sakaue and the staff of the UTHF for help with the sample collection and RNA extraction. Computations were partially performed on the NIG supercomputer at ROIS National Institute of Genetics and the supercomputer of AFFRIT, MAFF, Japan. We also thank the Editors and anonymous reviewers, whose comments greatly improved the database. Disclosures The authors have no conflicts of interest to declare. References Buels R., Yao E., Diesh C.M., Hayes R.D., Munoz-Torres M., Helt G., et al.  . ( 2016) JBrowse: a dynamic web platform for genome visualization and analysis. Genome Biol . 17: 66. Google Scholar CrossRef Search ADS PubMed  Cañas R.A., Canales J., Gómez-Maldonado J., Ávila C., Cánovas F.M. ( 2014) Transcriptome analysis in maritime pine using laser capture microdissection and 454 pyrosequencing. Tree Physiol . 34: 1278– 1288. Google Scholar CrossRef Search ADS PubMed  Corbesier L., Vincent C., Jang S., Fornara F., Fan Q., Searle I., et al.  . ( 2007) FT protein movement contributes to long-distance signaling in floral induction of Arabidopsis. Science  316: 1030– 1033. Google Scholar CrossRef Search ADS PubMed  Davidson N.M., Hawkins A.D.K., Oshlack A. ( 2017) SuperTranscripts: a data driven reference for analysis and visualisation of transcriptomes. Genome Biol.  18: 148. Google Scholar CrossRef Search ADS PubMed  de Miguel M., Bartholome J., Ehrenmann F., Murat F., Moriguchi Y., Uchiyama K., et al.  . ( 2015) Evidence of intense chromosomal shuffling during conifer evolution. Genome Biol. Evol.  7: 2799– 2809. Google Scholar PubMed  Eiga S., Sakai A. ( 1984) Altitudinal variation in freezing resistance of Saghalien fir (Abies sachalinensis). Can. J. Bot.  62: 156– 160. Google Scholar CrossRef Search ADS   Eiga S., Sakai A. ( 1987) Regional variation in cold hardiness of Sakhalin fir (Abies sachalinensis Mast.) in Hokkaido, Japan. In Plant Cold Hardiness . Edited by Li P.H. pp. 169– 182. Alan R. Liss, Inc., New York. Emanuelsson O., Nielsen H., von Heijne G. ( 1999) ChloroP, a neural network-based method for predicting chloroplast transit peptides and their cleavage sites. Protein Sci.  8: 978– 984. Google Scholar CrossRef Search ADS PubMed  Emms D.M., Kelly S. ( 2015) OrthoFinder: solving fundamental biases in whole genome comparisons dramatically improves orthogroup inference accuracy. Genome Biol.  16: 157. Google Scholar CrossRef Search ADS PubMed  Fujimura T., Shigeta S., Suwa T., Kawamoto S., Aki T., Masubuchi M., et al.  . ( 2005) Molecular cloning of a class IV chitinase allergen from Japanese cedar (Cryptomeria japonica) pollen and competitive inhibition of its immunoglobulin E-binding capacity by latex C-serum. Clin. Exp. Allergy  35: 234– 243. Google Scholar CrossRef Search ADS PubMed  Gilbert D. ( 2013) Gene-omes built from mRNA seq not genome DNA. In 7th Annual Arthropod Genomics Symposium, Notre Dame. Goto S., Kajiya-Kanegae H., Ishizuka W., Kitamura K., Ueno S., Hisamoto Y., et al.  . ( 2017) Genetic mapping of local adaptation along the altitudinal gradient in Abies sachalinensis. Tree Genet. Genomes  13: 104. Google Scholar CrossRef Search ADS   Gouzy J., Carrere S., Schiex T. ( 2009) FrameDP: sensitive peptide detection on noisy matured sequences. Bioinformatics  25: 670– 671. Google Scholar CrossRef Search ADS PubMed  Hisamoto Y., Goto S. ( 2017) Genetic control of altitudinal variation on female reproduction in Abies sachalinensis revealed by a crossing experiment. J. For. Res . 22: 195– 198. Google Scholar CrossRef Search ADS   Horton P., Park K.J., Obayashi T., Fujita N., Harada H., Adams-Collier C.J., et al.  . ( 2007) WoLF PSORT: protein localization predictor. Nucleic Acids Res . 35: W585– W587. Google Scholar CrossRef Search ADS PubMed  Ishizuka W., Goto S. ( 2012) Modeling intraspecific adaptation of Abies sachalinensis to local altitude and responses to global warming, based on a 36-year reciprocal transplant experiment. Evol. Appl . 5: 229– 244. Google Scholar CrossRef Search ADS PubMed  Ishizuka W., Ono K., Hara T., Goto S. ( 2015) Influence of low- and high-elevation plant genomes on the regulation of autumn cold acclimation in Abies sachalinensis. Front. Plant Sci.  6: 890. Google Scholar CrossRef Search ADS PubMed  Kato R. ( 1952) The vegetation of the University of Tokyo Hokkaido Forest (in Japanese with English summary). Bull. Tokyo Univ. For . 43: 1– 18. Kurahashi A., Iguchi K. ( 1993) The relation between age and first cone setting and tree size of Saghalien fir (Abies sachalinensis) planted tree (in Japanese). Trans. Meet. Hokkaido Branch Jpn. For. Soc.  41: 157– 159. Kurahashi A., Ogasawara S., Iguchi K. ( 1993) Variation in the characters of Saghalien fir (Abies sachalinensis) associated with altitudinal gradients—the growth and flower-setting at the age of nineteen years of the planted offspring families of individuals growing at various altitudes (in Japanese). Trans. Jpn. For. Soc.  104: 417– 420. Le Provost, G., Herrera, R., Paiva, J.A., Chaumeil, P., Salin, F. and Plomion, C. ( 2007) A micromethod for high throughput RNA extraction in forest trees. Biol Res. 40: 291– 297. Li B., Dewey C.N. ( 2011) RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics  12: 323. Google Scholar CrossRef Search ADS PubMed  Li H., Handsaker B., Wysoker A., Fennell T., Ruan J., Homer N., et al.  . ( 2009) The sequence alignment/map format and SAMtools. Bioinformatics  25: 2078– 2079. Google Scholar CrossRef Search ADS PubMed  Li W., Godzik A. ( 2006) Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics  22: 1658– 1659. Google Scholar CrossRef Search ADS PubMed  Martin M. ( 2011) Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet J . 17: 10. Google Scholar CrossRef Search ADS   McKenna A., Hanna M., Banks E., Sivachenko A., Cibulskis K., Kernytsky A., et al.  . ( 2010) The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res . 20: 1297– 1303. Google Scholar CrossRef Search ADS PubMed  Moriya Y., Itoh M., Okuda S., Yoshizawa A.C., Kanehisa M. ( 2007) KAAS: an automatic genome annotation and pathway reconstruction server. Nucleic Acids Res . 35: W182– W185. Google Scholar CrossRef Search ADS PubMed  Nystedt B., Street N.R., Wetterbom A., Zuccolo A., Lin Y.-C., Scofield D.G., et al.  . ( 2013) The Norway spruce genome sequence and conifer genome evolution. Nature  497: 579– 584. Google Scholar CrossRef Search ADS PubMed  Parra G., Bradnam K., Korf I. ( 2007) CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes. Bioinformatics  23: 1061– 1067. Google Scholar CrossRef Search ADS PubMed  Petersen T.N., Brunak S., von Heijne G., Nielsen H. ( 2011) SignalP 4.0: discriminating signal peptides from transmembrane regions. Nat. Methods  8: 785– 786. Google Scholar CrossRef Search ADS PubMed  Roschanski A.M., Fady B., Ziegenhagen B., Liepelt S. ( 2013) Annotation and re-sequencing of genes from de novo transcriptome assembly of Abies alba (Pinaceae). Appl. Plant Sci.  1: 1200179. Google Scholar CrossRef Search ADS   Schulz M.H., Zerbino D.R., Vingron M., Birney E. ( 2012) Oases: robust de novo RNA-seq assembly across the dynamic range of expression levels. Bioinformatics  28: 1086– 1092. Google Scholar CrossRef Search ADS PubMed  Thiel T., Michalek W., Varshney R.K., Graner A. ( 2003) Exploiting EST databases for the development and characterization of gene-derived SSR-markers in barley (Hordeum vulgare L.). Theor. Appl. Genet.  106: 411– 422. Google Scholar CrossRef Search ADS PubMed  Yamazaki T. ( 1995) Pinaceae. In Pteriodophyta and Gymnospermae . Edited by Iwatsuki K., Yamazaki T., Boufford D.E., Ohba H. pp. 266– 277. Kodansha, Tokyo. Yang Y., Smith S.A. ( 2013) Optimizing de novo assembly of short-read RNA-seq data for phylogenomics. BMC Genomics  14: 328. Google Scholar CrossRef Search ADS PubMed  Zerbe P., Chiang A., Yuen M., Hamberger B., Draper J.A., Britton R., et al.  . ( 2012) Bifunctional cis-abienol synthase from Abies balsamea discovered by transcriptome sequencing and its implications for diterpenoid fragrance production. J. Biol. Chem.  287: 12121– 12131. Google Scholar CrossRef Search ADS PubMed  Zerbino D.R., Birney E. ( 2008) Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res . 18: 821– 829. Google Scholar CrossRef Search ADS PubMed  Abbreviations Abbreviations BLAST Basic Local Alignment Search Tool CEG core eukaryotic gene CEGMA core eukaryotic genes mapping approach FPKM fragments per kilobase of transcript per million mapped reads FT Flowering Locus T GO Gene Ontology KAAS KEGG Automatic Annotation Server KEGG Kyoto Encyclopedia of Genes and Genomes RAD restriction-site associated DNA RNA-Seq RNA sequencing SNV single nucleotide variant SSR simple sequence repeat TSA transcriptome shotgun assembly UTHF University of Tokyo Hokkaido Forest Footnotes Footnotes The nucleotide sequence reported in this paper has been submitted to the DDBJ DRA with accession numbers DRA004908–DRA004911 © The Author(s) 2018. Published by Oxford University Press on behalf of Japanese Society of Plant Physiologists. All rights reserved. For permissions, please email: journals.permissions@oup.com This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/about_us/legal/notices)

Journal

Plant and Cell PhysiologyOxford University Press

Published: Mar 15, 2018

There are no references for this article.

You’re reading a free preview. Subscribe to read the entire article.


DeepDyve is your
personal research library

It’s your single place to instantly
discover and read the research
that matters to you.

Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month

Explore the DeepDyve Library

Search

Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly

Organize

Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.

Access

Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.

Your journals are on DeepDyve

Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.

All the latest content is available, no embargo periods.

See the journals in your area

DeepDyve

Freelancer

DeepDyve

Pro

Price

FREE

$49/month
$360/year

Save searches from
Google Scholar,
PubMed

Create lists to
organize your research

Export lists, citations

Read DeepDyve articles

Abstract access only

Unlimited access to over
18 million full-text articles

Print

20 pages / month

PDF Discount

20% off