Background: Chloroplasts have their own genomes, independent from nuclear genomes, that play vital roles in growth, which is a major targeted trait for genetic improvement in Populus. Angiosperm chloroplast genomes are maternally inherited, but the chloroplast’ variation pattern of poplar at the single-base level during the transmission from mother to offspring remains unknown. Results: Here, we constructed high-quality and almost complete chloroplast genomes for three poplar clones, ‘NL895’ and its parents, ‘I69’ and ‘I45’, from the short-read datasets using multi-pass sequencing (15–16 times per clone) and ultra-high coverage (at least 8500× per clone), with the four-step strategy of Simulation–Assembly– Merging–Correction. Each of the three resulting chloroplast assemblies contained contigs covering > 99% of Populus trichocarpa chloroplast DNA as a reference. A total of 401 variant loci were identified by a hybrid strategy of genome comparison-based and mapping-based single nucleotide polymorphism calling. The genotypes of 94 variant loci were different among the three poplar clones. However, only 1 of the 94 loci was a missense mutation, which was located in the exon region of rpoC1 encoding the β’ subunit of plastid-encoded RNA polymerase. The genotype of the loci in NL895 and its female parent (I69) was different from that of its male parent (I45). Conclusions: This research provides resources for further chloroplast genomic studies of a F full-sibling family derived from a cross between I69 and I45, and will improve the application of chloroplast genomic information in modern Populus breeding programs. Keywords: Chloroplast genome, Populus, Transmission, Multi-pass sequencing, Full-sibling family Background as Populus full- or half-sibling (sib) progeny. The pat- Poplar is an important plantation tree species being terns of genetic variation in chloroplast DNAs (cpDNAs) genetically improved. The genetic improvement of traits among full-sib progeny and their parents, which can be associated with growth is a major facet of Populus identified through the comparison of whole-chloroplast breeding . Chloroplasts are important organelles that genome sequences, contain valuable information for supply energy for the growth and development of green hybrid tree breeding. The maternal inheritance of plants through photosynthesis. Chloroplast genomes cpDNAs in the genus Populus was determined by contain many genes associated with photosynthesis, such restriction fragment analysis . However, it is very diffi- as genes encoding ribulose-(1,5)-bisphosphate carboxyl- cult to study the uniparental inheritance mode of ase/oxygenase (RuBisCO)  and plastid-encoded RNA cpDNAs within a Populus full-sib family using only that polymerase (PEP) . method due to the ultra-high level of cpDNA sequence High-quality chloroplast genomes are essential for the similarities and the limited restriction enzyme sites in genome comparison of closely related organisms, such the chloroplast genome [5, 6]. The transmission of cpDNAs from the maternal parent to F offspring may be analyzed at both the whole-chloroplast genome and * Correspondence: firstname.lastname@example.org; email@example.com single-base levels, using a whole-chloroplast genome Co-Innovation Center for Sustainable Forestry in Southern China, Nanjing comparison. In addition, the effects of a cross between Forestry University, Nanjing 210037, China © The Author(s). 2018 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Zhu et al. BMC Genomics (2018) 19:411 Page 2 of 15 Populus deltoides as the maternal parent and Populus × different de novo assemblers and the assemblies merged euramericana as the paternal parent is usually superior tool CISA . Whole-chloroplast genome comparisons to those of its reciprocal cross . For instance, Wang among the three clones were performed to analyze the and Huang et al. selected and obtained a few superior genetic variations occurring during the transmission clones, such as Nanlin895 (NL895) (http://www.shtree.- process of cpDNAs from mother to offspring. com/nl-895.htm) and NL95 (http://www.shtree.com/ nl-95.htm), from the F hybrids between P. deltoides Results Bartr. cv. ‘I-69/55’ (I69, ♀) and P. × euramericana Gui- Feasibility evaluation of the reference-assisted strategy nier. cv. ‘I-45/51’ (I45, ♂). The differences between the In this study, we used the hybrid strategy of both reciprocal crosses is difficult to explain using Mendelian reference-assisted and de novo assembly to isolate and inheritance modes, such as the biparental inheritance of construct a complete chloroplast genome. Evaluating the the nuclear genome, but it can be explained by the applicability of the reference-assisted strategy to the uniparental inheritance of cpDNA. However, determin- Populus cpDNAs’ assembly is required before being ing the genetic mechanism underlying the differences used. The feasibility of the reference-assisted strategy between reciprocal crosses is beneficial for parental was assessed on the basis of several aspects, including selection as an important part of Populus breeding the degree of chloroplast sequence similarity between programs. Thus, the comparison of high-quality the target and reference species, the de novo assembly of chloroplast genomes has the potential to provide chloro- short-read data simulated from the reference chloroplast plast genome-associated information for the genetic genome, and the proportion of the chloroplast reads in improvement of Populus through the use of hybrid all of the reads generated from whole-cell sequencing. breeding technology. Complete chloroplast genome construction consists of Sequence similarities of Populus cpDNAs two parts: obtaining chloroplast reads and constructing The highly conserved features of the chloroplast chloroplast genome sequences from the reads. In genomes are well-known. However, it was unclear general, to generate cpDNA reads, cpDNA sequences whether the reference-assisted method was suitable for are isolated from whole cells or extracted from constructing the Populus chloroplast genome from whole-genome shotgun reads. Because of the advances whole-genome sequencing reads, owing to the unknown in next-generation sequencing (NGS), in terms of time levels of sequence similarity among cpDNAs from and cost, and the increases in the number of available species in the genus Populus L. Thus, we performed chloroplast genomes, whole-genome shotgun sequencing pairwise comparisons among the cpDNA sequences based on NGS technology is increasingly used to con- from six Populus species, P. alba, P. tremula, P. euphra- struct plant chloroplast genomes, such as Brassica rapa tica, P. fremontii, P. balsamifera and P. trichocarpa, and Raphanus sativus [8, 9]. using the LASTZ genome alignment tool (Fig. 1). The The chloroplast genome sequences of six poplar 156–157-kbp chloroplast genomes from the six Populus species, belonging to four sections in the genus Populus, species, belonging to four Populus sections, had an over are available from the NCBI Organelle Genome 99.5% sequence identity. These Populus chloroplast Resources (http://www.ncbi.nlm.nih.gov/genome/organ- genomes shared the same quadripartite structure, in- elle/). Although poplar chloroplasts have conserved plas- cluding an 84–85 kbp large single-copy region (LSC) tid genomes of < 160 kilobase pairs (kbp) in length, the and a 16-kbp small single-copy region (SSC) separated de novo assembly of complete chloroplast genomes for by a pair of inverted repeats (IRs), IRa and IRb, with Populus species remains challenging because of a pair of sequence lengths of 27 kbp and a sequence similarity of inverted repeats, IRa and IRb, of approximately 27 kbp > 99.5%. Compared with the other five Populus species, in size and more than 99.5% sequence similarity (as the P. trichocarpa chloroplast genome had a relatively described in the Results section). This study aims to higher sequence similarity (~ 100%) between IRa and IRb. construct high-quality poplar chloroplast genomes from The complete chloroplast genomes of these species whole-genome shotgun Illumina reads using de novo within the genus Populus exhibited relatively high levels and reference-guided strategies. of sequence similarity. However, it was still unknown Here, we obtained chloroplast HiSeq 2000 reads at a whether all chloroplast reads can be extracted from total sequencing depth of > 8500× each extracted from whole-genome shotgun sequencing reads generated on an the whole-genome shotgun and multiple passes (15–16 Illumina sequencing platform using the reference-assisted times per clone) sequencing for three poplar clones, strategy. The Illumina-like and error-free reads, which including NL895 and its parents (I69 and I45). The were obtained from six Populus cpDNAs using the high-quality chloroplast genomes of the three poplar split-read method (as described in the ‘Materials and clones were constructed by a combination of eight Methods’ section), were mapped to each cpDNA of these Zhu et al. BMC Genomics (2018) 19:411 Page 3 of 15 Fig. 1 Whole-genome dot-plot comparison of six Populus cpDNAs. The chloroplast genomes of all six Populus species, P. alba, P. tremula, P. euphratica, P. fremontii, P. balsamifera and P. trichocarpa, were compared using LASTZ (v1.03.28). The aligned blocks are represented as blue lines. The blocks aligned in the reverse orientation are a pair of inverted repeats (IRa & IRb) in the chloroplast genomes. The starts and ends of the aligned blocks are labeled with transparent red points Populus species using Bowtie and BWA (Additional file 1: assembly and the ratio of chloroplast reads to Table S1). The average percentage of short reads mapped whole-genome sequencing reads are two basic to P. trichocarpa cpDNA (95%) as a reference was slightly elements for estimating the required number and cost higher than those of the other five Populus cpDNAs as of whole-genome shotgun reads. references. A total of 95.8% reads, which were generated The numbers and lengths of chloroplast reads are two from the P. fremontii (Populus Sect. Aigeiros) chloroplast key factors for constructing chloroplast genomes using genome through the split-read method, were mapped to the shotgun sequencing strategy. To assess the effects of the P. trichocarpa cpDNA. Based on the metrics of both the two factors on the chloroplast genome assembly, 12 sequence similarity and the mapping ratio, the P. tricho- short-read datasets for all combinations of the three carpa chloroplast genome was selected as a reference for different read lengths (60, 80 and 100 bp) with four dif- 4 5 6 subsequent analyses. ferent read amounts—10 k (10 ), 100 k (10 ), 1 M (10 ) and 10 M (10 ) of paired-end reads—were simulated with Assembly of the simulated reads P. trichocarpa cpDNA as a reference and used for the The cost of DNA sequencing is frequently a consider- simulation of de novo poplar chloroplast genome assembly. ation in de novo chloroplast genome assembly projects. First, we used KmerGenie to estimate the genome To calculate the effective cost of whole-genome shotgun sizes for the 12 combinations of read lengths and read sequencing, it is necessary to estimate the amounts of amounts (60 bp–10 k, 60 bp–100 k, 60 bp–1M,60bp– short reads required for the generation of a 10 M, 80 bp–10 k, 80 bp–100 k, 80 bp–1M, 80bp– high-quality assembly of the chloroplast genome. The 10 M, 100 bp–10 k, 100 bp–100 k, 100 bp–1 M and minimum amount of chloroplast reads for de novo 100 bp–10 M). The estimated values of four datasets, Zhu et al. BMC Genomics (2018) 19:411 Page 4 of 15 including 60 bp–100 k, 80 bp–100 k, 100 bp–100 k and Tukey’s multiple comparison. The running times for the 60 bp–1 M, were the nearest to the genome size of the ref- 10-k and 100-k pair-read datasets were significantly erence cpDNA (P. trichocarpa), and ranged from 129.7 kbp different from those of the 1-M and 10-M pair reads (82.6%) to 130.7 kbp (82.8%) (Additional file 1:Table S2). (Fig. 2b). In addition, the effects of the pairwise interac- The estimated size appeared equal to the total size of the tions between k-mer value, read length and read amount combined LSC, IRa and SSC regions, rather than the were not detected by two-way ANOVA. combined size of all four regions, LSC, IRa, SSC and IRb, Like the assembly time, the resulting 60 assemblies of the Populus cpDNA (157 kb). This may be due to the were not different between the before and after PCR extremely high level of sequence similarity (~ 100%) be- duplicate removal in terms of the frequently-used as- tween the IRa and IRb regions, which are 27 kbp in length. sembly metrics, such as total length, contig number, The de novo assembly of each of the 12 read datasets mean length, min length, max length, N80/L80, N50/ was performed using Minia with five k-mer values of 19, L50 and N20/L20 (N80/L80, N50/L50 and N20/L20 are 29, 39, 49 and 59. A total of sixty (12 datasets × 5 k-mers) defined by QUAST, http://quast.bioinf.spbau.ru/man- assemblies were obtained and used for estimating the ual.html). To further compare the qualities of the 60 effects of the three factors (read length, read amount assemblies on 12 simulated read datasets, we selected and k-mer value) on chloroplast genome assembly. the assemblies having total bases of > 80% the reference Time consumption and the quality of the de novo gen- genome size (157 kbp) and resulting contigs of at least ome assembly are the two most important aspects of the 200 bp in length. A total of the 13 assemblies from 6 sim- cpDNA assembly. The time required for these 60 assem- ulated read datasets, including 60 bp (read length)–100 k blies was not significantly different between before and (read pairs), 60 bp–1M,60 bp–10 M, 80 bp–100 k, after the PCR duplicate removal (Fig. 2a), because there 100 bp–10 k and 100 bp–100 k, were selected and used were < 0.1% PCR duplicates. The effects of k-mer, read for further analyses (Additional file 1: Table S3). However, length and read amount on the assembly time were a large proportion of contigs for each of the 13 assemblies tested using an analysis of variance (ANOVA). There were far < 1 kbp in length. To decrease the contig − 5 was a statistically significant difference (P = 4.58e )in numbers and increase assembly contiguity, the resulting the assembly times among all four levels of read contigs were merged into the larger contigs using the amounts (10 k, 100 k, 1 M and 10 M). To determine reference chloroplast genome as a guide. After merging which level of read amount was different from the contigs in the 13 assemblies, the number of contigs was others, pairwise comparisons of the assembly time greatly reduced, with only one contig in the three assem- among the four read amounts were performed using blies. The genome fraction rates of the 13 assemblies after ab 4000 4000 3000 3000 2000 2000 1000 1000 0 0 0 1000 2000 3000 4000 10000 100000 1000000 10000000 Time (before filter) Read_number Fig. 2 Running times for assembling the simulated short reads data. a The running times for assembling the simulated short reads data before or after filtering. The x-axis and y-axis represent the assembly times (in seconds) of the simulated short reads before and after which PCR duplicates were filtered from raw data, respectively. The times for assembling the simulated short reads data before or after filtering are indicated by the transparent blue circles. The equation ‘y = x’ is plotted as the black dotted line. b Boxplot of the running times for assembling four simulated 4 5 6 7 4 5 6 7 reads data sets (10 ,10 ,10 and 10 pairs of short reads). The assembly times for 10 ,10 ,10 and 10 pairs of simulated reads are colored as red, green, cyan and purple, respectively. Two black dashed lines in the figure represent two equations ‘y = 550’ and ‘y = 100’, respectively Time (after filter) Time Zhu et al. BMC Genomics (2018) 19:411 Page 5 of 15 contig merging ranged from 73.56 to 82.40% and were genome. The chloroplast reads were isolated from all of close to the estimated genome size as determined by the whole-genome short reads for the I69, I45 and KmerGenie. The merged contigs for these assemblies oc- NL895 clones, by aligning reads against the P. tricho- cupied the almost complete LSC, IRa and SSC regions on carpa cpDNA using BWA and Bowtie aligners. The the reference cpDNA (Fig. 3). The IRb region was missing cpDNA ratios of read datasets for the three poplar in the resulting assemblies because of the nearly 100% se- clones I45, I69 and NL895 were ~ 4.2, 7.5 and 5.0%, quence identity between IRa and IRb on the P. trichocarpa respectively (Additional file 1: Table S4). The ratio of cpDNA. The three assemblies consisting of only one con- cpDNA reads for each poplar clone was not significantly tig were assembled from the 100-k pair-read datasets with different among multiple datasets (15–16), except the 60-, 80- and 100-bp read lengths, and their genome cover- first dataset. age rates ranged from 82.20 to 82.40%. In addition, the Each of the 47 real datasets of whole-genome shotgun average number of mismatches per 100 kbp of aligned sequencing had chloroplast reads of more than 10 bases for the 13 assemblies ranged from 3.16 to 10.05. (100 k) pairs. Of these, 41 datasets contained chloroplast Based on the metrics used for assessing the chloroplast reads ranging from 100 k to 10 (1 M) pairs, which was assemblies on these 12 simulated datasets, the number similar to the amount of whole-genome shotgun reads of reads seemed to be a more important factor than read required for cpDNA assembly in the simulated reads length. The most complete three assemblies were study above. More than 1 M pairs of chloroplast reads obtained from three datasets of 100-k pair reads, and were extracted from the six short-read datasets, includ- the genome sizes of these assemblies were approxi- ing one dataset for the clone I45, four datasets for I69 mately equal to the estimated genome size. Thus, we and one dataset for NL895 (Fig. 4). The numbers of presumed that 100 k–1 M pairs may be the suitable chloroplast reads from the poplar clones I45 and NL895 number of reads required for constructing the were less than those of clone I69. Populus chloroplast genome. Chloroplast genome assembly Proportion of the chloroplast reads We employed four different parameter strategies to The ratio of cpDNA reads to whole-genome sequencing assemble a chloroplast genome from each of the 47 reads is another key factor that contributes to the esti- chloroplast read datasets, which were isolated from mation of the amount of whole-genome shotgun reads whole-genome sequencing reads from the three poplar required for the construction of the Populus chloroplast clones I45, I69 and NL895. The first strategy used Fig. 3 The fraction of P. trichocarpa chloroplast genome covered by the assemblies of the simulated short reads under multiple k-mer values. The resulting assemblies with genome fractions > 128 kbp are shown in this figure. The quadrant structure of the chloroplast genome is composed of large single-copy (LSC) and small single-copy (SSC) regions separated by a pair of inverted repeats (IRa and IRb); the regions of both IRa and IRb are labeled with the wide transparent yellow band. The x-axis and y-axis represent the genome assemblies for the simulated reads data and the locations of the reference genome covered by contigs from the genome assemblies, respectively Zhu et al. BMC Genomics (2018) 19:411 Page 6 of 15 Fig. 4 The total base count and read pairs of 48 short-read datasets. The x-axis and y-axis represent the total base count and the read pairs (read amount) of 48 short-read datasets from three poplar clones (I45, I69 and NL895), respectively. The horizontal and vertical green dashed lines 8 6 represent the equations ‘y =2×10 ’ and ‘x =1 × 10 ’, respectively multiple k-mer values (k = 19, 21, 23…63) with which The time used for cpDNA assembly chloroplast genomes were de novo assembled using four De novo short-read assembly requires intensive compu- assembly tools (ABySS, Minia, SOAPdenovo2 and tational time. The running time for de novo cpDNA Velvet). The second strategy ran the assembler SGA assembly may be affected by a number of factors, such using multiple overlap sizes (m = 41, 45, 51, 55, 61 and as data volume (total numbers or total bases of short 63). The third strategy ran the assembler IDBA using reads), assembly tools and assembly parameters. There multiple differential iterative steps (step = 2, 4, 8, 10, 20 seems to be a positive correlation between data volume and 30). The last strategy ran the SPAdes and Edena and the running time. The correlation coefficient values assemblers using default parameters. A total of 4982 as- for the total number of reads and their running times semblies were performed on all 47 datasets using the using all eight assemblers ranged from 0.70 to 0.98, and eight assembly tools with the corresponding parameter were slightly lower than those for total bases and their strategies, 4980 of which successfully obtained contigs. running times. The running times using the same as- sembler generally increased with the increasing amounts The estimated genome size of short reads used for cpDNA assembly, which seemed The cpDNA genome sizes of the three poplar clones to also be consistent with the results of the previous (I45, I69 and NL895) were estimated from the all 47 simulated read assembly. datasets based on the k-mer distribution before the de Obviously, the assembly times for the same datasets novo assembly of the poplar chloroplast genome. The were different among the eight assemblers. For example, genome size estimates of all 47 datasets were very close the assembly times using SPAdes of all 47 read datasets to 129 kbp, which was approximately equal to the total were more than those using Edena. In comparison with bases of the LSC, IRa and SSC regions of the Populus the assemblies of SPAdes and Edena that were per- chloroplast genome. This was in accordance with the formed under the default parameters only, the assem- results of the previously performed simulated reads ana- blies of SGA, IDBA, ABySS, Minia, SOAPdenovo and lysis. Nevertheless, the estimated values of read dataset Velvet on the same dataset were performed using several (Set) 1.1 of both clones I45 and NL895 were far less than customized parameters. The running times of assemblies 129 kbp. This may be because the read amounts of these generated by the six assemblers may be influenced not two datasets were greater than 1 M (10 ) pairs, which only by read volume and assembly tools, but also by dif- was far from the number of reads required for the ferent parameters. To determine the effects of both read poplar chloroplast genome’s assembly. volume and parameters on the running time, we used a Zhu et al. BMC Genomics (2018) 19:411 Page 7 of 15 multivariate ANOVA to analyze each of the assemblies high values of > 0.91. totalSum and CoverRatio had very generated by the six assemblers. The running times of low correlations with the other nine metrics. Thus, four the six assemblers (p < 0.001), except SGA (p = 0.07), metrics, N50, num, totalSum and CoverRatio, were used were significantly different for multiple values of the to select the optimum assembly from each assembler corresponding parameters. The running times of the using each dataset under differential parameters, such as IDBA assemblies on the same read dataset increased in k-mer, overlap and iterative step. For assemblies from general with the decreasing ‘step’ parameter as a result the same assembler using the same datasets, there were of the increased number of iterations in an assembly no differences in the num metric among the different when the ‘step’ parameter was set to the smallest value. parameters. For the assemblies of the IDBA, Minia, The assembly times of the four k-mer-based assemblers SOAPdenovo, SGA and Velvet five assemblers, no (ABySS, Minia, SOAPdenovo and Velvet) increased significant differences were shown in the CoverRatio and broadly with the decreasing k-mer values. Thus, the totalSum among the different parameters. However, the running times of short-read assemblies for constructing CoverRatio and totalSum of the ABySS assemblies with chloroplast genomes were mainly under the significant certain k-mers were relatively higher than those of the influence of the three factors, including data (reads) other k-mers. Thus, the selection criteria for the optimum volume, assembly tools and assembly parameters. Add- assemblies for the ABySS, IDBA, Minia, SOAPdenovo, itionally, the running times for the SGA were longer SGA and Velvet assemblers were as follows: than those of the other seven assemblers, based on the results from multiple comparison tests. (a). Of the assemblies generated by IDBA, Minia, SOAPdenovo, SGA and Velvet assemblers, the The initial de novo assembly assemblies with the greatest N50 values were From the 4980 resulting assemblies, we screened 2200 selected from the assemblies using the same dataset assemblies having total base lengths of > 126 kbp and under different parameters as the optimum cpDNA total contig numbers of < 100, which were selected as a assembly. preliminary selection criteria on the basis of the previous (b).For the ABySS assemblies, the assemblies with the simulated reads analysis. The number of read datasets greatest N50 contig sizes were selected as the that were assembled and met the selection criteria are optimum when the CoverRatio for the same dataset summarized in Table 1. ABySS, IDBA and SPAdes were was less than 85%. Otherwise, we chose the the only three assembly tools that successfully assembled assembly with the greatest N50 from ABySS the 47 datasets into eligible assemblies. SGA was the assemblies with CoverRatios of > 85% as the least successful assembler, producing eligible assemblies optimum assembly. from only 2 of the 47 datasets. To effectively select the optimum assembly from each Based on the selection criteria, we obtained 235 assembler using each dataset, we analyzed the relation- optimum assemblies, 2 assemblies from SGA, 9 from ships among 11 assembly metrics, including the Edena, 15 from SOAPdenovo, 25 from Velvet, 43 from numbers of contigs (num), the total bases of contigs Minia and 47 from the three assemblers ABySS, IDBA (totalSum), the mean lengths of contigs (mean), the max and SPAdes. To further analyze these assemblies, we lengths of contigs (max), N80 s and L80 s of contigs compared the N50, num, totalSum and CoverRatio met- (N80 and L80), N50 s and L50 s of contigs (N50 and rics. The N50 values of the Edena, Minia and Velvet L50), N20 s and L20 s of contigs (N20 and L20), and the assemblies were far less than those of the other five ratio of the reference chloroplast genome covered by the tools. The num (contig number) values of the Edena, aligned contigs (CoverRatio). The relationships among Minia and Velvet assemblies were relatively high at the 11 metrics are shown in Fig. 5. The correlation greater than 42 compared with those of the other five coefficients between N50 and four metrics, N80, N20, assemblers (less than 25). The totalSum (total base) max and mean, were at least 0.74. The coefficients be- values of the Velvet assemblies were much greater than tween num and three metrics, L80, L50 and L20, had those of the other tools, being at least 225 kbp, which Table 1 The number of read datasets successfully assembled by each assembly tool Clone Assembly tool ABySS Edena IDBA Minia SGA SOAPdenovo SPAdes Velvet I45 16 5 16 16 0 6 16 8 I69 15 1 15 12 0 3 15 7 NL895 16 3 16 15 2 6 16 6 Zhu et al. BMC Genomics (2018) 19:411 Page 8 of 15 Fig. 5 The correlations among 11 metrics for a genome assembly assessment. This figure, which was plotted using the R package ‘corrplot’, represents the correlation matrix of the 11 metrics used for assessing poplar chloroplast genome assembly. The 11 metrics in the figure were ‘num’ (number of contigs), ‘totalSum’ (the total base of contigs), ‘mean’ (the mean length of contigs), ‘max’ (the max length of contigs), ‘CoverRatio’ (the ratio of reference chloroplast genome covered by the aligned contigs), N80, L80, N50, L50, N20 and L20. N80, L80, N50, L50, N20 and L20 are defined by QUAST greatly deviated from the 157 kbp of the Populus refer- for each dataset. To further improve the quality levels of ence chloroplast genome. The CoverRatio values of the the cpDNA assemblies from the three poplar clones, we SGA and SOAPdenovo assemblies were less than 80.6%. adopted the strategy of merging multiple assemblies into Thus, the Edena, Minia, SGA, SOAPdenovo and Velvet one to reduce errors and to extend contig lengths. Given assemblies were inferior to those of the remaining three the preferable contiguity of the assemblies generated by assemblers, ABySS, IDBA and SPAdes. Thus, the ABySS, SPAdes relative to those of the other two assemblers, the IDBA and SPAdes assemblies were used for the follow- SPAdes assemblies were selected as input for merging ing improved assembly. the initial contigs. The contigs from all of the SPAdes assemblies for each poplar clone were merged into large The improved assembly contigs using the genomic assemblies merging tool The IR sequence was easily lost in the de novo assembly CISA(v1.3) after the removal of misassembled contigs of the entire Populus chloroplast genome owing to the identified by QUAST (v3.1). By merging these contigs high similarity between IRa and IRb. The genome size based on the MUMmer (v3.23) pairwise alignment, estimates of the simulated data and real data in the the large contigs were merged into three super study were approximately 129 kbp, covering the LSC + contigs of 128,701, 128,864 and 128,786 bp for the IRa + SSC regions, and almost equal to those of the de three clones I45, I69 and NL895, respectively. All novo assemblies. Thus, we divided the entire chloroplast three super contigs were aligned to 1–129,435 bp of assembly into two parts, the LSC + IRa + SSC and the the P. trichocarpa chloroplast genome, fully covering IRb regions. the LSC + IRa + SSC region. However, there were no significant differences among The two chloroplast IRs, which are long equal blocks the SPAdes, IDBA and ABySS assemblies of each dataset of ultra-high sequence similarity, posed a challenge for for the I45, I69 and NL895 poplar clones. Thus, it is very constructing the high-quality assembly of chloroplast difficult to select the optimum assembly from those genome sequences. The acquisition of the contigs span- generated by the SPAdes, IDBA and ABySS assemblers ning the IRb region was undertaken in the three steps. Zhu et al. BMC Genomics (2018) 19:411 Page 9 of 15 Frist, contigs that overlapped with IRb were selected Comparing whole chloroplast genome sequences from all of the assemblies generated by the SPAdes, The chloroplast genome sequences of three poplar IDBA and ABySS assemblers on 15 or 16 datasets per clones, I45, I69 and NL895, were compared separately poplar clone. Next, these selected contigs were integrated with those of P. trichocarpa using ‘nucmer’ and into the contigs completely containing or partly overlap- ‘show-snps’ in the MUMmer package. A total of 407 ping the IRb region, on the basis of its genomic compari- variants were identified in the chloroplast genomes of son with the reference cpDNA. We obtained three I45, I69 and NL895 compared with the reference IRb-overlapped contigs of 30,817, 30,819 and 41,369 bp cpDNA. These variants consisted of 211 SNPs and 196 for the I45, I69 and NL895 clones, respectively. InDels. The vast majority of SNP loci (181) were located The contigs covering the LSC + IRa + SSC region and in the LSC and SSC regions of the cpDNAs, but only 30 overlapping the IRb region were further merged into SNPs occurred in the IRa and IRb regions. Similarly, three super contigs of 145,764, 145,929 and 156,380 bp only six deletions and six insertions were situated in the for the I45, I69 and NL895 clones, respectively. The IR pair in the chloroplast genome. three resulting super contigs for I45, I69 and NL895 Nearly three-quarters of all variants (310) identified represented regions of 1–146,502, 1–146,504 and 1– in the poplar chloroplast genome were the same 157,033 bp, respectively, on the P. trichocarpa chloro- across all three poplar clones. The variants consisted plast genome. of 180SNPsand 130InDels, including67insertions Nevertheless, an approximate 11 kbp of partial and 63 deletions. The 180 SNPs were divided into sequences on the IRb region was lost in the chloroplast 150 loci in the LSC and SSC regions and 30 loci in assemblies for both I45 and I69 but was present in those the IRa and IRb regions. Only 12 of the 130 InDels of NL895. IRa and IRb on NL895 were a pair of reverse were consistent among the three poplar clones and complementary sequences of 27,649 bp in length and located in the pair of IR regions of the cpDNAs. nearly 100% sequence identity, like the IRa and IRb from The remaining one-quarter of variants (97) in the pop- the P. trichocarpa cpDNA. The IRa sequence for NL895 lar chloroplast genome were not completely consistent was the same as that of both I45 and I69. In addition, among the I45, I69 and NL895 clones. All of the variants the partial sequence of the IRb region in the I45 and I69 that were inconsistent among I45, I69 and NL895 were cpDNA was the complete reverse complement of its located in the LSC (93) and SSC (4) regions, but not in counterpart in the IRa region. Because the sequences of the IRa and IRb regions. The 93 inconsistent variants sit- both IRa and IRb on the Populus chloroplast genome uated in the LSC region were composed of 30 SNPs, 17 formed a pair of inverted repeats that were extremely insertions and 46 deletions. All of the inconsistent vari- similar or even identical, it was possible that the IRa and ants were divided into three groups, including 40 vari- IRb sequences of the clones I45 and I69 were a pair of ants that were the same between the two poplar clones identical reverse complementary repeats, like those of I45 and I69 but not to NL895 (marked as I45 = I69), 44 NL895 and P. trichocarpa cpDNA. Using that assump- variants that were consistent between I45 and NL895 tion, the partial sequence lost in the IRb region was (I45 = NL895) and 13 variants that were consistent inferred from the complete IRa sequence of I45 and I69. between I69 and NL895 (I69 = NL895). These results Final, we obtained complete chloroplast genome seem to suggest that the differences between I69 and I45 sequences of 156,295 and 156,458 bp for the clones I45 chloroplast genomes were less than those between I69 and I69. and NL895. This could be because P. × euramericana cv. I45 was derived from the cross between P. deltoides Comparison of the chloroplast genomes as the maternal parent and P. nigra as the paternal par- We compared the chloroplast genomes among the three ent. Thus, P. × euramericana cv. I45 obtained a P. del- clones (I45, I69 and NL895) in two different ways, toides chloroplast genome, which was very similar to the comparing their genome sequences and aligning the chloroplast genome of P. deltoides cv. I69. reads to reference cpDNAs of P. trichocarpa.A whole-genome sequence comparison is the most direct Comparing the SNPs obtained by mapping way to identify differences in sequences between individ- The SNPs for each dataset of the three poplar clones uals or genotypes. However, it is very difficult to discover I45, I69 and NL895 were called by read alignment to the heterozygous loci within an individual using only the P. trichocarpa chloroplast genome using BWA and sequence comparison. Read mapping to call single SAMtools. We used the read mapping strategy to iden- nucleotide polymorphisms (SNPs) and insertions/dele- tify 311 variant loci, 71 of which were problematic and tions (InDels) complements the whole-genome sequence discarded. Of these, 68 (> 95.7%) had poor repeatability comparison in discovering differences among the chloro- within the same poplar clone. The other three problem- plast genomes of individuals. atic variants were inconsistent loci at which the Zhu et al. BMC Genomics (2018) 19:411 Page 10 of 15 predicted reference alleles of the three poplar clones genome. Only 65 loci with identical genotypes, 30 silent were different. These problematic variants were mainly (synonymous) and 35 missense mutations, were located detected in or near the cpDNA locations containing in the coding regions or exons of the genes. For low-complexity sequences and simple sequence repeats. example, nine of these loci were located in the exon of The vast majority of these problematic loci were consist- the RNA polymerase (rpo) gene family, consisting of ent with the results obtained by the genome comparison rpoA, rpoB, rpoC1 and rpoC2, in the poplar chloroplast strategy as described previously. Thus, the genome genome. The nine loci included a missense mutation in comparison strategy could be used to correct the prob- the rpoA gene, four silent (synonymous) mutations in lematic variants that were identified through the read rpoB, two silent (synonymous) mutations in rpoC1, and mapping strategy. In addition, the other 240 variants two missense mutations in rpoC2.The α, β, β’ and β” identified by the read mapping strategy were retained for subunits of PEP, which is a key enzyme involved in plant further analysis. photosynthesis, were encoded by four genes, rpoA, rpoB, rpoC1 and rpoC2 . The vast majority of the other 94 Differences in the variants among the three clones variations with differences in genotypes between the To discover accurately the differences in the chloroplast three poplar clones were located in the non-coding genome sequences among the poplar clones I-45, I-69 region of the poplar chloroplast genome. Nevertheless, and NL895, we integrated 407 and 240 variants identi- only one variant was situated in the coding region of the fied by the chloroplast genome comparison and read poplar cpDNA, leading to changes in the amino acid mapping strategies, respectively, and obtained 401 sequence of the PEP β’ subunit encoded by rpoC1.The variants consisting of 213 SNPs, 108 deletions and 80 in- identified genetic variants on exons of rpoA, rpoB, rpoC1 sertions. In total, 299 variant loci, approximately and rpoC2 may have an impact on the transcriptional three-quarters of all the loci, were located in the LSC level of photosynthesis-related genes, which are medi- region of the poplar chloroplast genome. In addition, the ated by PEP . The deletion of each of rpoA, rpoB, problematic variants, which had been mostly identified rpoC1 and rpoC2 can cause defects in plant photosyn- using the read mapping strategy, were in low-complexity thesis [15, 16]. However, no photosynthetic defects were regions, large InDel regions of > 5 bp in length and short found in the three poplar clones (I45, I69 and NL895). tandem repeat regions (especially the homopolymers Thus, the identified genetic variants on exons of rpoA, polyA or polyT). Six heterogeneous variants, which rpoB, rpoC1 and rpoC2 of I45, I69 and NL895 should could not be detected by the genome comparison strat- not cause loss or gain of these rpo genes functions. egy, were improperly identified using the mapping strat- The genotypes of the loci identified by genome egy. Similarly, single-nucleotide variant heteroplasmy sequence comparison strategy were consistent with had been found in the mitochondrion of mice and hu- those identified by the read mapping strategy. The geno- man, using Single-Mitochondrion sequencing . types of the loci of the poplar clone NL895 were identi- All 401 variants were divided into two groups cal to those of its maternal parental clone, I69, and (Additional file 1: Table S5), including 307 variant loci in different from those of its male parental clone, I45. which the genotypes of the three poplar clones (NL895, Thus, the offspring cpDNAs were derived from the I-45 and I69) were identical. The remaining 94 loci with female parent in species of the genus Populus. genotypes were not identical among these three clones. Of the 307 identical variants, 42 were situated in the IR Discussion pair (IRa and IRb) of the poplar chloroplast genome. Chloroplast genome construction None of the variant loci with different genotypes were In the study, we constructed high-quality chloroplast located in the IRa and IRb regions. Thus, the pair of genomes for three poplar clones, the superior clone long interspersed IRs (IRa and IRb) was more conserva- NL895 and its female (I69) and male (I45) parents, from tive than the other two single-copy regions (LSC and 15 to 16× whole-genome shotgun sequences per clone SSC), which was consistent with the results found in on the HiSeq 2000 platform using a combination of de rice, maize and bamboo [12–14]. This may be due in novo and reference-assisted strategies. A high-quality part to the biased GC content. The overall GC content assembly of the chloroplast genome is essential for com- of the IRa and IRb (41.97%) regions in the poplar paring chloroplast genomes between closely related cpDNAs was more than those of the LSC (34.47%) and species or individuals, especially for genetic relationships SSC (30.54%) regions. between parent and offspring or between full-sibs. How- Of the 307 variant loci having the same genotype ever, chloroplast genome assemblies are challenging due among the three poplar clones, most were located in the to influences of both sequencing design and assembly non-coding regions, such as introns, and upstream and strategies and the degree of kinship between the refer- downstream sequences, of the poplar chloroplast ence and species to be assembled. Zhu et al. BMC Genomics (2018) 19:411 Page 11 of 15 Sequencing design with the reference cpDNA is usually the lowest peak and The length, number and distribution of short reads used an obstacle for the complete cpDNA assembly. In for assembling cpDNA are the three primary features of contrast, the PacBio sequencing platform is able to sequencing design. The length of the reads generated on generate a uniform distribution of reads across the entire Illumina sequencing platforms appeared to not affect the genome . assembly of the chloroplast genomes for the three poplar Although the PacBio sequencing platform has many clones, and this was supported by the resulting assem- advantages, such as longer reads and uniform coverage blies of both the simulated Illumina-like reads and the distributions, second-generation sequencing platforms real Illumina reads. However, this only indicates that (such as Illumina platforms) remain proper platforms for there are no significant differences in the de novo as- de novo cpDNA assembly owing to the accuracy, ex- sembly among Illumina reads ranging from 60 to 120 bp tremely low single-nucleotide cost, high reproducibility in length. Compared with the short reads of the Illumina and high-throughput of reads from these platforms . sequencing platform, the longer reads of the third gener- Additionally, multi-pass sequencing, to some extent, can ation sequencing platform (such as PacBio long reads remedy the deficiency of short Illumina reads. In having a mean length of > 10 kb, http://www.pacb.com/) particular, when high-quality cpDNA of closely-related have the potential to resolve long repeats in the chloro- species are available, the Illumina sequencing platform is plast genome and make it relatively easy to de novo a relatively good choice for simultaneously assembling assemble the chloroplast genome. For example, the cpDNA of multiple samples from organisms that are complete chloroplast genome of pineapple (Ananas closely related (such as parent–offspring and full-sibs) comosus) containing a pair of IR regions (IRa and IRb) and identifying mutations in their cpDNAs. 26.7 kbp in size has been constructed using PacBio long reads . However, a high error rate (~ 15%) and rela- Assembly strategy tively high cost per base pair are two shortcoming of De novo assembly and contig merging are the two main PacBio long reads compared with Illumina short reads steps for cpDNA construction in this study. Eight de . Thus, the hybrid assembly of Illumina short reads novo assemblers, Velvet, SOAPdenovo2, ABySS, Minia, with their high accuracy and PacBio long reads with Edena, SGA, IDBA and SPAdes, were used for the de their low accuracy may be a promising strategy for novo-assembly of all 47 datasets from the three poplar constructing high-quality chloroplast genomes. clones (NL895 and its parents I45 and I69). Compared The number of reads used for a single independent as- with the other five assemblers, ABySS, IDBA and sembly of a chloroplast genome is a factor that causes SPAdes provided relatively superior results in terms of differences in the resulting assemblies. The assemblies the metrics used in the assessment of chloroplast based on simulated and real datasets in this study did genome assemblies. The resulting assemblies generated not suggest that more reads result in superior assem- using the other five assemblers had some deficiencies, blies. A million pairs of chloroplast reads (10 )isa such a large numbers of fairly short contigs. In addition, suitable amount for the de novo assembly of poplar the parameter k-mer specified in the four assemblers, cpDNAs. However, the suitable number of whole-genome ABySS, Minia, SOAPdenovo2 and Velvet, influenced the shotgun sequencing reads for assembling cpDNAs de novo chloroplast assemblies. depends on the ratio of chloroplast reads in the It is necessary to order and orient the de novo-assembled whole-genome sequencing data as well as the number of contigs. We applied the contig integrator CISA to merge de chloroplast reads to be assembled. There are differences novo assemblies of multiple datasets for each of the clones. in the chloroplast ratios of whole-genome sequencing The quality levels of these cpDNAs were slightly improved, reads among organisms or clones. The cpDNA ratios of particularly in the accuracy and continuity of the resulting the three poplar clones, NL895, I45 and I69, were assemblies. 5.51, 4.99 and 8.82%, respectively. The cpDNA ratios of the Sanger sequencing data for two rice cultivars Kinship (PA64S and 93–11) are less than 2.3% . Thus, it High-quality chloroplast genomes are very important for is necessary to determine the whole-genome reads comparing chloroplast genomes and accurately identify- amount to estimate the cpDNA ratio in advance of ing mutations in cpDNAs, specifically for parents–off- whole-genome shotgun sequencing. spring or F full-sib progeny comparisons. The genetic The uneven coverage distribution of reads is an inher- distance between the target genome to be assembled ent defect of second-generation sequencing platforms and its reference genome is an important factor for the (such as Illumina sequencing platforms), just like the de novo assembly of chloroplast reads isolated from distribution of Illumina HiSeq reads in this study . whole-genome sequencing data. For example, it is diffi- The read depths of gaps or missing regions compared cult to capture cpDNA reads in regions that are inserted Zhu et al. BMC Genomics (2018) 19:411 Page 12 of 15 or deleted in the target cpDNA compared with in the generated here. Additionally, we will perform association reference chloroplast genome. In this study, we chose analyses of poplar chloroplast genotypes in combination the P. trichocarpa cpDNA as a reference to extract almost with 24-year growth-associated phenotypic data for two all of the chloroplast reads from the whole-genome parents and 64 of their offspring. shotgun reads of the three clones. Nevertheless, the construction of the chloroplast genomes for the three Methods poplar clones revealed that it was still difficult to construct Chloroplast genome resources and computing high-quality poplar chloroplast genomes using only the environment isolated cpDNA reads from single-pass sequencing data. The complete chloroplast genome sequences of six This may be due in part to the 27,649-bp IRa and IRb poplar species, P. alba (GenBank accession no. sequences with nearly 100% sequence identity. Finally, we NC_008235), P. tremula (NC_027425), P. euphratica obtained high-quality and nearly-complete chloroplast (NC_024747), P. fremontii (NC_024734), P. balsamifera genome sequences for the three poplar clones by merging (NC_024735) and P. trichocarpa (NC_009143), were contigs of multiple assemblies per poplar clone. downloaded from Organelle Genome Resources at NCBI (http://www.ncbi.nlm.nih.gov/genome/organelle/). The Chloroplast variations in the transmission from one six poplar species belong to four sections of the genus generation to the next Populus, sect. Populus (P. alba and P. tremula), sect. In the study, we used an entire chloroplast genome com- Turanga (P. euphratica), sect. Aigeiros (P. fremontii) and parison to preliminarily determine the pattern of the sect. Tacamahaca (P. balsamifera and P. trichocarpa). spontaneous mutations in the chloroplast genome of a The analysis and mining of large-scale sequence poplar hybrid F generation. During the transmission of datasets, such as read alignment, SNP calling and se- poplar chloroplasts from female parent to offspring, the quence assembly, was conducted on a DELL PowerEdge chloroplast’s mutations were mainly located in the R910 server with 512 GB RAM (32 × 16 GB). Ubuntu is non-coding and single-copy regions. This may help to the Linux operating system on the R910 server. explain differences in breeding effects between a cross and its reciprocal cross. These results contribute import- Pre-assessing reference-assisted strategy ant auxiliary information for forestry tree breeding, such To select the right reference from the six poplar as parental selection and the prediction of potential cpDNAs, two steps were performed. Firstly, to estimate mutations based on chloroplast genomes of female and the sequence similarities between these cpDNAs, pair- male parents. However, spontaneous mutations in wise genome alignments were conducted using LASTZ full-sib families have rarely been reported, such as in (v1.03.28, http://www.bx.psu.edu/~rsharris/lastz/) with Drosophila melanogaster . Thus, the identification of at least a 95% identity. Secondly, to infer the efficiency spontaneous mutations in a full-sib family using whole of each poplar cpDNA as a reference genome, the num- chloroplast genome comparisons is a prerequisite for ber of the identified cpDNA reads was calculated based breeding programs. on the alignment of the simulated short reads against Additionally, we identified the female parent of a each poplar cpDNA. The split-reads method, which split natural hybrid clone as I45 by comparing cpDNAs of the reference sequence into many 100-bp fragments with I45 and P. deltoides clone I69, which suggests that a a 1-bp sliding window, was used to generate simulated whole chloroplast genome comparison can be used to short reads from these poplar cpDNAs. Mapping the determine the female origins of varieties and clones. short reads against cpDNAs for each of the six Populus Information on the parental origins of hybrid varieties species was performed using BWA (v0.7.12) and Bowtie are important for poplar evolutionary studies and breed- (v1.1.1) with default parameters [23, 24]. ing programs. To evaluate the feasibility of the reference-assisted strategy for the poplar chloroplast genome assembly, we Future studies utilized the reads simulator wgsim (v0.3.0, https:// While this research is insufficient to completely illustrate github.com/lh3/wgsim) to simulate and generate that the law of genetic variation occurred in the chloro- Illumina-like paired-end reads in an attempt to assemble plast genomes within a Populus full-sib family, it builds poplar cpDNA from Illumina-like reads. Based on the a framework for studying and understanding the above findings, the P. trichocarpa chloroplast genome relationships of spontaneous Populus chloroplast muta- (NC_009143) was chosen as a reference sequence for tions with its growth traits. Thus, in future studies, we the wgsim simulator. The duplicates of the simulated will construct the chloroplast genomes of another 64 paired reads were identified and discarded with FastUniq progeny from the same full-sib family using the same (v1.1) . The genome size for the Illumina-like reads strategy and the three high-quality cpDNA sequences generated by wgsim was predicted using KmerGenie Zhu et al. BMC Genomics (2018) 19:411 Page 13 of 15 (v1.6982) . The simulated reads were de novo assem- each dataset from the three clones using eight de bled into the cpDNA by Minia (v2.0.3) . These Minia novo genome assemblers, Velvet (v1.2.10), SOAPde- assemblies were assessed using the assembly evaluation novo2 (v2.04), ABySS (v1.9.0), Minia (v2.0.3), Edena tool QAUST (v3.1) with P. trichocarpa cpDNA as a (v3.131028), SGA (v0.10.13), IDBA (v1.1.1) and reference genome . SPAdes (v3.6.0) [27, 29–35]. These eight de novo as- semblers were grouped into four parameter-based Sampling and sequencing types: (I) ABySS, Minia, SOAPdenovo2 and Velvet The Populus materials used in the study were harvested were used with 23 odd k-mer values from 19 to 63 from the Zhangji Forestry Centre at Xuzhou, Jiangsu for poplar cpDNA assembly; (II) SGA was applied Provinces, China. DNA samples were taken from the with six various overlapping sizes (−m) of 41, 45 51, fresh green leaves of three Populus sect. Aigeiros clones, 55, 61 and 63; (III) IDBA was utilized with six differ- I69 (P. deltoides cv. ‘I-69/55’), I45 (P. × euramericana ent iterative step values (−step) of 2, 4, 8, 10, 20 and cv. ‘I-45/51’), and NL895, which is a genotype from the 30; (IV) Edena and SPAdes were run under their F progeny of the interspecific hybrids between I69 respective default parameters only. A total of 4982 de (female parent) and I45 (male parent). Total DNA was novo assemblies were performed on all 47 read data- extracted from the fresh leaves for each Populus clone sets from the three poplar clones in the study. In using a DNeasy Plant Mini Kit (QIAGEN, Hilden, addition, the running-time for each assembly was re- Germany, Cat No.69104) according to the manufac- corded by an in-house Perl script. turer’s instructions. The quantification and purity of the Subsequently, the resulting assemblies for the 47 data- extracted DNA was assessed using agarose gel electro- sets were assessed using QUAST with the P. trichocarpa phoresis and a NanoDrop 2000 spectrophotometer. The cpDNA (NC_009143) as a reference . A visual extracted DNA was used for paired-end sequencing on inspection of the cpDNA assemblies was conducted on the Illumina HiSeq 2000. DNA sequencing was carried the integrative genomics viewer IGV (v2.3.67) . out 15–16 times for each clone. The whole-genome To order and orient the contigs assembled from shotgun sequencing of the three Populus clones was multiple datasets for each clone, we utilized the contig performed at the Chinese National Human Genome integrator CISA (v1.3) to merge contigs from various as- Center at Shanghai (http://www.chgc.sh.cn/). semblies for the same clone, and these were manually curated into a contig covering the LSC + IRa + SSC Extracting cpDNA reads region of the chloroplast genome . Then, the de A quality assessment of the raw short reads generated novo assembled contigs over the IRb region were on Illumina HiSeq platform were carried out using screened based on the QUAST assessments and merged FastQC (v0.11.5, http://www.bioinformatics.babraham.a- into a contig by CISA. Contigs spanning both the LSC + c.uk/projects/fastqc/), and quality control measures were IRa + SSC and IRb regions were manually merged into performed with the FASTX-Toolkit (v0.0.14, http://han- contigs based on the reference cpDNA. nonlab.cshl.edu/fastx_toolkit/). The low-quality ends The quality of the constructed cpDNAs for the three were trimmed from the raw short reads using the clones was assessed by QUAST with the P. trichocarpa FASTX-Toolkit. The trimmed reads were used for cpDNA as the reference. The depth of reads across the further analyses. entire cpDNA per dataset was calculated from the BAM The cpDNA reads were isolated from whole-genome format using ‘genomecov’ in the bedtools package shotgun read data by mapping the trimmed reads to the (v2.25.0) . Gaps in assemblies for the same clone reference chloroplast genome. The trimmed reads were were combined by ‘merge’ in the bedtools package. mapped onto the P. trichocarpa chloroplast genome (NC_009143) using Bowtie and BWA. The paired-end Comparing chloroplast genomes reads mapped onto the P. trichocarpa chloroplast The constructed chloroplast genomes for the three genome using both Bowtie and BWA were screened and poplar clones were annotated with DOGMA (http://dog- integrated as the cpDNA reads for the chloroplast gen- ma.ccbb.utexas.edu/) . BLAST algorithm-based ome assembly of the three poplar clones. The duplicates searches against annotated genes from six poplar refer- were identified from the paired-end reads of poplar ence cpDNAs were performed to detect potential genes chloroplast using FastUniq (v1.1)  and discarded. in the cpDNAs of the three clones. tRNA genes in these cpDNAs were identified by tRNAScan-SE (v1.21, http:// Constructing chloroplast genomes lowelab.ucsc.edu/tRNAscan-SE/). Frist, a de novo assembly strategy was employed to make The chloroplast genomes for the three poplar clones, the initial assembly of the selected cpDNA reads. The including NL895 and its parents I69 and I45, were cpDNA reads were assembled into the initial contigs for pairwise compared using MUMmer (v3.23, http:// Zhu et al. BMC Genomics (2018) 19:411 Page 14 of 15 mummer.sourceforge.net/) . Four regions, LSC, IRa, Availability of data and materials The short reads generated in the study are available in the NCBI Sequence SSC and IRb, were identified by ‘repeat-mach’ in the Read Archive (SRA) under BioProject accession PRJNA430966. MUMmer package. The reads per dataset were aligned to the correspond- Authors’ contributions ing cpDNA assembled in the study using BWA (v0.7.12). MRH conceived the project. SZ designed the experiments. MX, HRW, HXP and GPW performed sample collection and DNA extraction. SZ performed SNPs and InDels (Insertions and Deletions) were the bioinformatics analysis and statistical tests. SZ wrote the manuscript. All detected directly from each of the 47 datasets by SAM- authors read and approved the final manuscript. tools (v1.2) and BCFtools (v1.2) [41, 42]. The annotation of the called variant loci was performed using SnpEff Ethics approval and consent to participate All poplar materials used in the study were the registered cultivars in with the P. trichocarpa cpDNA as the reference. International Poplar Commission of the Food and Agriculture Organization (IPC-FAO, http://www.fao.org/forestry/ipc/en/). As researchers of Nanjing Forestry University, we are allowed to use these poplar materials for research Conclusions in China. Sampling of these poplar materials were performed in compliance In this study, the high-quality and complete chloro- with institutional, national, and international guidelines. plast genomes for three poplar clones, the superior clone NL895 and its both parents I69 (P. deltoides Competing interests Bartr. cv. ‘I-69/55’, ♀)and I45(P. × euramericana The authors declare that they have no competing interests. Guinier. cv. ‘I-45/51’, ♂), were constructed using ultra-depth coverage (>8500×) of chloroplast genome Publisher’sNote per clone. The three high-quality chloroplast genomes Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. can serve as an important resource for further study- ing the chloroplast variation pattern within full-sib Received: 15 August 2017 Accepted: 22 May 2018 family of I69 and I45. Furthermore, the chloroplast spontaneous mutations between parents and offspring References provide a potential application of cpDNA information 1. Du Q, Gong C, Wang Q, Zhou D, Yang H, Pan W, Li B, Zhang D. Genetic in Populus breeding via molecular design. architecture of growth traits in Populus revealed by integrated quantitative trait locus (QTL) analysis and association studies. New Phytol. 2016;209(3): 1067–82. Additional file 2. Spreitzer RJ, Salvucci ME. Rubisco: structure, regulatory interactions, and possibilities for a better enzyme. Annu Rev Plant Biol. 2002;53:449–75. Additional file 1: Table S1. The ratio of split-reads mapped to each 3. Kremnev D, Strand A. Plastid encoded RNA polymerase activity and cpDNA of six Populus species using Bowtie and BWA. Table S2. The gen- expression of photosynthesis genes required for embryo and seed ome size of P. trichocarpa cpDNA was estimated using each of twelve development in Arabidopsis. Front Plant Sci. 2014;5:385. sets of simulated reads (60 bp–10 k, 60 bp–100 k, 60 bp–1M,60bp– 4. Rajora OP, Dancik BP. Chloroplast DNA inheritance in Populus. Theor Appl 10 M, 80 bp–10 k, 80 bp–100 k, 80 bp–1M,80bp–10 M, 100 bp–10 k, Genet. 1992;84(3–4):280–5. 100 bp–100 k, 100 bp–1 M and 100 bp–10 M). Table S3. Basic statistics 5. Nock CJ, Waters DL, Edwards MA, Bowen SG, Rice N, Cordeiro GM, Henry RJ. of 13 assemblies used for further analyses. Table S4. The cpDNA and Chloroplast genome sequences from total DNA for plant identification. gDNA ratios of read datasets for the three poplar clones I45, I69 and Plant Biotechnol J. 2011;9(3):328–33. NL895. Table S5. The positions, reference and alternative base, types 6. Matsuoka Y, Yamazaki Y, Ogihara Y, Tsunewaki K. Whole chloroplast genome (Replace|Insert|Delete) and the genotypes of all 401 variants identified in comparison of rice, maize, and wheat: implications for chloroplast gene the three poplar clones. (XLSX 31 kb) diversification and phylogeny of cereals. Mol Biol Evol. 2002;19(12):2084–91. 7. Zhang Q. Populus genetic improvement within Aigeiros. Scientia Silvae Sinicae. 1987;23(2):174–81. Abbreviations 8. Jeong YM, Chung WH, Mun JH, Kim N, Yu HJ. De novo assembly and ANOVA: Analysis of variance; cpDNAs: Chloroplast DNAs; InDels: Insertions/ characterization of the complete chloroplast genome of radish (Raphanus Deletions; IRs: Inverted repeats; kbp: Kilobase pairs; LSC: Large single-copy; sativus L.). Gene. 2014;551(1):39–48. NGS: Next-generation sequencing; PEP: Plastid-encoded RNA polymerase; 9. Wu J, Liu B, Cheng F, Ramchiary N, Choi SR, Lim YP, Wang XW. Sequencing RuBisCO: Ribulose-(1,5)-bisphosphate carboxylase/oxygenase; sib: Sibling; of chloroplast genome using whole cellular DNA and solexa sequencing SNPs: Single nucleotide polymorphisms; SSC: Small single-copy technology. Front Plant Sci. 2012;3:243. 10. Lin SH, Liao YC. CISA: contig integrator for sequence assembly of bacterial Acknowledgments genomes. PLoS One. 2013;8(3):e60843. We acknowledge Yongqiang Zhu and Shengyue Wang at the Chinese 11. Morris J, Na YJ, Zhu H, Lee JH, Giang H, Ulyanova AV, Baltuch GH, Brem S, National Human Genome Center at Shanghai for assistance in interpreting Chen HI, Kung DK, et al. Pervasive within-mitochondrion single-nucleotide Illumina sequencing data. variant Heteroplasmy as revealed by single-mitochondrion sequencing. Cell Rep. 2017;21(10):2706–13. Funding 12. Tang J, Xia H, Cao M, Zhang X, Zeng W, Hu S, Tong W, Wang J, Wang J, Yu This work was supported by grants from the Natural Science Foundation of J, et al. A comparison of rice chloroplast genomes. Plant Physiol. 2004; JiangSu Province (BK20150879), the ‘Twelfth Five-Year’ National Science and 135(1):412–20. Technology Support Program (2012BAD01B03), the China Postdoctoral Sci- 13. Yamane K, Yano K, Kawahara T. Pattern and rate of indel evolution inferred ence Foundation Grant (2014 T70530) and a project funded by the Priority from whole chloroplast intergenic regions in sugarcane, maize and rice. Academic Program Development of Jiangsu Higher Education Institutions. DNA Res. 2006;13(5):197–204. The funding bodies did not have a role in the design of the study, data col- 14. Zhang YJ, Ma PF, Li DZ. High-throughput sequencing of six bamboo lection, analysis, interpretation of data, writing the manuscript, nor the deci- chloroplast genomes: phylogenetic implications for temperate woody sion to publish. bamboos (Poaceae: Bambusoideae). PLoS One. 2011;6(5):e20596. Zhu et al. BMC Genomics (2018) 19:411 Page 15 of 15 15. Serino G, Maliga P. RNA polymerase subunits encoded by the plastid rpo 41. Li H. A statistical framework for SNP calling, mutation discovery, association genes are not shared with the nucleus-encoded plastid enzyme. Plant mapping and population genetical parameter estimation from sequencing Physiol. 1998;117(4):1165–70. data. Bioinformatics. 2011;27(21):2987–93. 16. Legen J, Kemp S, Krause K, Profanter B, Herrmann RG, Maier RM. 42. Narasimhan V, Danecek P, Scally A, Xue Y, Tyler-Smith C, Durbin R. BCFtools/ Comparative analysis of plastid transcription profiles of entire plastid RoH: a hidden Markov model approach for detecting autozygosity from chromosomes from tobacco attributed to wild-type and PEP-deficient next-generation sequencing data. Bioinformatics. 2016;32(11):1749–51. transcription machineries. Plant J. 2002;31(2):171–88. 17. Redwan RM, Saidin A, Kumar SV. Complete chloroplast genome sequence of MD-2 pineapple and its comparative analysis among nine other plants from the subclass Commelinidae. BMC Plant Biol. 2015;15:196. 18. Miclotte G, Heydari M, Demeester P, Rombauts S, Van de Peer Y, Audenaert P, Fostier J. Jabba: hybrid error correction for long sequencing reads. Algorithms Mol Biol. 2016;11:10. 19. Chen YC, Liu T, Yu CH, Chiang TY, Hwang CC. Effects of GC bias in next- generation-sequencing data on de novo genome assembly. PLoS One. 2013;8(4):e62856. 20. Ono Y, Asai K, Hamada M. PBSIM: PacBio reads simulator–toward accurate genome assembly. Bioinformatics. 2013;29(1):119–21. 21. Quail MA, Smith M, Coupland P, Otto TD, Harris SR, Connor TR, Bertoni A, Swerdlow HP, Gu Y. A tale of three next generation sequencing platforms: comparison of ion torrent, pacific biosciences and Illumina MiSeq sequencers. BMC Genomics. 2012;13:341. 22. Keightley PD, Ness RW, Halligan DL, Haddrill PR. Estimation of the spontaneous mutation rate per nucleotide site in a Drosophila melanogaster full-sib family. Genetics. 2014;196(1):313–20. 23. Li H, Durbin R. Fast and accurate short read alignment with burrows- wheeler transform. Bioinformatics. 2009;25(14):1754–60. 24. Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 2009;10(3):R25. 25. Xu H, Luo X, Qian J, Pang X, Song J, Qian G, Chen J, Chen S. FastUniq: a fast de novo duplicates removal tool for paired short reads. PLoS One. 2012; 7(12):e52249. 26. Chikhi R, Medvedev P. Informed and automated k-mer size selection for genome assembly. Bioinformatics. 2014;30(1):31–37. 27. Chikhi R, Rizk G. Space-efficient and exact de Bruijn graph representation based on a bloom filter. Algorithms Mol Biol. 2013;8(1):22. 28. Gurevich A, Saveliev V, Vyahhi N, Tesler G. QUAST: quality assessment tool for genome assemblies. Bioinformatics. 2013;29(8):1072–5. 29. Zerbino DR, Birney E. Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 2008;18(5):821–9. 30. Luo R, Liu B, Xie Y, Li Z, Huang W, Yuan J, He G, Chen Y, Pan Q, Liu Y, et al. SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. GigaScience. 2012;1(1):18. 31. Simpson JT, Wong K, Jackman SD, Schein JE, Jones SJ, Birol I. ABySS: a parallel assembler for short read sequence data. Genome Res. 2009;19(6):1117–23. 32. Stanciu C, Bennett JR. Alginate-antacid in the reduction of gastro- oesophageal reflux. Lancet. 1974;1(7848):109–11. 33. Simpson JT. Exploring genome characteristics and sequence quality without a reference. Bioinformatics. 2014;30(9):1228–35. 34. Peng Y, Leung HC, Yiu SM, Chin FY. IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth. Bioinformatics. 2012;28(11):1420–8. 35. Nurk S, Bankevich A, Antipov D, Gurevich AA, Korobeynikov A, Lapidus A, Prjibelski AD, Pyshkin A, Sirotkin A, Sirotkin Y, et al. Assembling single-cell genomes and mini-metagenomes from chimeric MDA products. J Comput Biol. 2013;20(10):714–37. 36. Thorvaldsdottir H, Robinson JT, Mesirov JP. Integrative genomics viewer (IGV): high-performance genomics data visualization and exploration. Brief Bioinform. 2013;14(2):178–92. 37. Quinlan AR, Hall IM. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010;26(6):841–2. 38. Wyman SK, Jansen RK, Boore JL. Automatic annotation of organellar genomes with DOGMA. Bioinformatics. 2004;20(17):3252–5. 39. Lowe TM, Eddy SR. tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Res. 1997;25(5): 955–64. 40. Kurtz S, Phillippy A, Delcher AL, Smoot M, Shumway M, Antonescu C, Salzberg SL. Versatile and open software for comparing large genomes. Genome Biol. 2004;5(2):R12.
– Springer Journals
Published: May 29, 2018