Background: The advent of –omics technologies has enabled the resolution of fine molecular differences among individuals within a species. DNA sequence variations, such as single nucleotide polymorphisms or small deletions, can be tabulated for many kinds of genotype comparisons. However, experimental designs and analytical approaches are replete with ways to overestimate the level of variation present within a given sample. Analytical pipelines that do not apply proper thresholds nor assess reproducibility among samples are susceptible to calling false-positive variants. Furthermore, issues with sample genotype identity or failing to account for heterogeneity in reference genotypes may lead to misinterpretations of standing variants as polymorphisms derived de novo. Results: A recent publication that featured the analysis of RNA-sequencing data in three transgenic soybean event series appeared to overestimate the number of sequence variants identified in plants that were exposed to a tissue culture based transformation process. We reanalyzed these data with a stringent set of criteria and demonstrate three different factors that lead to variant overestimation, including issues related to the genetic identity of the background genotype, unaccounted genetic heterogeneity in the reference genome, and insufficient bioinformatics filtering. Conclusions: This study serves as a cautionary tale to users of genomic and transcriptomic data that wish to assess the molecular variation attributable to tissue culture and transformation processes. Moreover, accounting for the factors that lead to sequence variant overestimation is equally applicable to samples derived from other germplasm sources, including chemical or irradiation mutagenesis and genome engineering (e.g., CRISPR) processes. Keywords: Soybean, Transgenic, Bioinformatics, Heterogeneity Background the intended trait encoded by the transgene, and con- The process of genetic transformation typically involves firmation that the transgenic plant does not have unin- inserting DNA sequences originating from one species tended consequences that may be detrimental to the into the genome of another species. This tool has been environment or to the consumer . Adverse effects are used to add traits into crop species, such as herbicide generally characterized in two categories: effects from tolerance in soybean and root worm tolerance in corn the transgene itself, and effects that arise from mutations [1–4]. The commercialization of transgenic products is resulting from gene insertion or the tissue culture subject to tight regulation, as transgenic strains must process. As a result, safety testing ensures that unin- undergo intense safety testing before being brought to tended DNA-level changes are not present in commer- market . The testing phase involves confirmation of cialized products [7, 8]. With the recent revolution in high-throughput * Correspondence: firstname.lastname@example.org sequencing technology, there is now increased interest Bioinformatics and Computational Biology Program, University of Minnesota, in understanding the molecular nature of transgenic Minneapolis, MN, USA events, and identifying possible safety implications of Department of Agronomy and Plant Genetics, University of Minnesota, 1991 Upper Buford Circle, 411 Borlaug Hall, Saint Paul, MN 55108, USA © The Author(s). 2018 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Michno and Stupar BMC Biotechnology (2018) 18:38 Page 2 of 9 unintended molecular changes that may result. This in- awareness of genomic heterogeneity within cultivars, formation may be useful in assessing the likelihood that and leveraging bioinformatics filters and replicated data a particular event will express the intended trait(s) with- as a way to minimize false positives. out detrimental unintended effects. Molecular studies have previously characterized the ef- Results and discussion fects of transgenesis in several different plant species, fo- Primary source of variation in transgenic event series 764: cusing on the sequence changes at transgene integration Incorrectly identified genetic background sites [9, 10] and/or the sequence changes genome-wide Lambirth et al. [21, 22] performed RNA-seq analyses of [11–19]. While no clear consensus has emerged, studies 27 transgenic plants, including nine individuals each se- utilizing sequence-level resolution have reported a range lected from three different transgenic series known as of possible sequence changes in transgenic plants, includ- ST77, ST111, and 764. They reported that all three of ing frequent observations (e.g., small deletions occuring these transgenic series were developed in the genetic adjacent to the integration site) and less frequent occur- background of cultivar ‘Williams 82’. As a control, they ances (e.g., translocations between chromosomes). also performed RNA-seq on nine individuals of A curious discrepancy in genome-wide sequence poly- ‘Williams 82’, thus resulting in a total of 36 RNA-seq morphisms has been observed in recent resequencing samples in the study. As ‘Williams 82’ was also the geno- studies of transgenic soybean. One study, published by type used to develop the soybean reference genome , our group , resequenced two independent transgenic all of the mutations reported by  were identified sim- T1 plants, and respectivley found only two and 18 single ply by comparing their transcriptome sequence to the nucleotide polymorphisms (SNPs) genome-wide (along reference genome. The authors reported surprisingly with deletions adjacent to the integrated transgene, as high mutation frequencies in both the transgenic and has been previously observed in other plant transform- control plants, particularly the 764 transgenic event ation studies). In contrast, Lambirth et al. [21, 22] series. As de novo mutations caused by the tissue cul- reported high rates of molecular variation among trans- ture or transgenesis pathway are expected to be unique genic soybean plants, both in terms of transcriptomic to a given event, the authors calculated the number of changes and DNA sequence changes. The authors ana- unique event-specific mutations in each series compared lyzed RNA-sequencing (RNA-seq) data on families from to the other groups/series in the study (i.e., the number three different transgenic events and reported thousands of mutations in one series that is not shared by the other of sequence variants per plant, focusing on SNPs and two series of transformants or the control ‘Williams 82’ small insertion-deletion (indel) variants. They reported plants). They reported a unique polymorphic SNP count tens of thousands of sequence variants in these plants, of 981 in event ST77, 927 in event ST111, and 7717 in including approximately 1000 to 7700 variants that were event 764. This discrepancy matched their earlier ana- unique to each of the three event series. This contrast lysis of gene expression variation among three series, between studies is even more surprising considering that where series 764 exhibited much greater expression vari- Anderson et al.  searched genome-wide while ation as compared to controls than did the other two Lambirth et al.  searched only the transcribed por- transgenic groups . tion of the genome. Both groups were studying the same Two findings in the Lambirth et al.  mutation ana- species (soybean) transformed by similar methods lysis stand out: (1) The SNP frequencies were much (Agrobacterium-mediated transformation of cotyledon- higher than other similar studies of soybean and ary nodes)  and resequenced using similar chemis- model plant species [11–19], particularly considering that tries (Illumina short-read). only the transcribed portion of the genome was being ana- Given the importance and real-world relevance of this lyzed; (2) Even with the generally high mutation rates re- topic, it is imperative to resolve the discrepancy between ported, the 764 series is still an outlier. To cross-validate the Anderson et al.  and Lambirth et al. [21, 22] the findings of this analysis, we downloaded and reana- studies. We are not aware of any transgenic resequen- lyzed the raw RNA-seq data from these studies. cing studies that have reported mutations rates similar Using the GATK Best Practices workflow [25, 26], we to those published by Lambirth et al. . Therefore, re-generated polymorphic SNP lists from all 36 samples the current study focuses on a reanalysis of the Lambirth of RNA-seq data used [21, 22]. As stated above, de novo et al.  dataset, applying a more stringent analytical SNPs generated by tissue culture or transformation pipeline. The outcome of this reanalysis demonstrates would be expected to be unique to each respective trans- that the Lambirth et al. [21, 22] studies overestimated genic event. Therefore, we focused our analysis on SNPs the transcriptional and DNA sequence variation in the that were unique to only one of the four groups (e.g., transgenic plants. These findings provide insight into the SNPs observed as an alternative base in one transgenic importance of identity preservation of genotypes, series, while matching the reference genome sequence in Michno and Stupar BMC Biotechnology (2018) 18:38 Page 3 of 9 the other two transgenic series and the ‘Williams 82’ ~ 4.9 Mb and ~ 5.9 Mb on chromosome 15. It is likely controls). Given that the transgenic plants were that this interval on chromosome 15 represents a region self-pollinated for several generations after transform- of genetic heterogeneity between the individual of ation, the SNPs derived from the tissue culture or trans- ‘Thorne’ used for transformation in the development of formation process are expected to be predominantly the 764 event and the individual(s) of ‘Thorne’ sampled homozygous. Therefore, we filtered our initial lists for for the USDA genotyping effort . While the series homozygous SNPs that are uniquely polymorphic 764 profile was a 99.2% match to ‘Thorne’ across the relative to the reference genome, compared to the 525 SNPs, the next closest match was ‘Washita’ (PI other transgenic lines and ‘Williams 82’ controls 618809) , which was only a 74.2% match. Both (Additional file 1: Figure S1). This analysis and filtering ‘Williams’ and ‘Williams 82’ had a 0% match rate to the pipeline differed from the Lambirth et al. pipelinein 525 SNPs in the 764 series (Fig. 1), as would be expected at least four critical ways: (1) The GATK Best Practices because the reference genome is based on ‘Williams 82’ workflow imposed a higher standard for calling variants and these SNPs were initially identified as polymorphic (see Methods section); (2) we did not include heterozy- between the 764 series and the reference genome. gous calls; (3) we did not include heterogeneous SNPs The clear conclusion from this analysis is that series among the nine samples of any group (the three trans- 764 was developed in ‘Thorne,’ rather than ‘Williams 82’. genic series or controls); (4) we required at least six out of ‘Thorne’ is commonly used for soybean transformation the nine samples within each group to exhibit the same (e.g., ). It is clear that the high polymorphism rate homozygous base call. reported in event series 764 is not an unintended conse- The analysis and filtering pipeline described above was quence of tissue culture or transgenesis. Instead, the ma- designed to prevent false-positive SNP calls. Nevertheless, jority (if not all) of the variation reported in this line is the pipeline was able to detect nearly 10,000 SNPs among simply standing variation that exists between ‘Thorne’ the transgenic samples (Table 1, Additional file 2:Table and ‘Williams 82’. This statement can be applied to all S1). However, the distribution of SNPs among the geno- previous reports of variation observed between these types was substantially different than what was reported plants, including gene transcription , mutations , previously . Almost all of the unique SNPs that we or any other characteristic. identified were found in transgenic series 764 (9738 out of the 9884 SNPs). Meanwhile, only 143 and 3 SNPs, re- Source of variation in transgenic event series ST77: Genetic spectively, were identified in ST77 and ST111 (Table 1). heterogeneity between different individuals of ‘William 82’ We postulated that the discrepancy exhibited by the The relatively lower polymorphism rates found in the re- 764 series might have resulted from experimental error analysis of S77 and S111 compared to that of 764 rather than biological factors. To test this, we compared (Table 1) indicated that these groups are likely derived the list of SNPs we generated (Table 1, Additional file 2: from the ‘Williams 82’ background. However, standing Table S1) with a list of pre-ascertained SNPs that were variation can persist within soybean cultivars , as the previously used to genotype the entire USDA soybean breeding process typically involves bulk harvesting of germplasm collection . We found that 525 of the breeding populations prior to full fixation of homozygos- SNPs that were unique to series 764 also matched the ity through single seed descent. Therefore, most soybean genome positions on the pre-ascertained SNP list cultivars are expected to exhibit slight differences from (Table 1, Additional file 2: Table S1). We compared the plant to plant [31, 32], as heterogeneous sub-lines fix SNP profile of these 525 SNPs for series 764 with all of different haplotypes within relatively small (but some- the accessions in the USDA collection. One genotype, times large) genomic intervals. For example, previous cultivar ‘Thorne’ (PI 564718) , was a nearly perfect genotyping of four different ‘Williams 82’ sub-lines re- match to series 764 (521 of the 525 SNPs match; Fig. 1). vealed specific regions of genomic variation on chromo- The four SNPs that did not match between series 764 somes 3, 7, 15 and 20 . and ‘Thorne’ were clustered together between positions It is relatively intuitive to identify genomic heterogen- eity between sub-lines of a cultivar, as sub-lines will show nearly complete homogeny throughout the gen- Table 1 Number of SNPs identified as unique for each transgenic line based on reanalysis of the RNA-Seq dataset ome, interrupted by specific regions with (sometimes dense) clusters of polymorphisms. We investigated 764 ST77 ST111 Williams 82 whether the 143 SNPs identified in our reanalysis of “Unique” SNPs found in 9738 143 3 0 the whole RNA-Seq dataset group ST77 could be explained by this type of standing heterogeneity between the ‘Williams 82’ controls used in “Unique” SNPs found in 525 11 0 0 the RNA-Seq dataset that the study and the ‘Williams 82’ individual that was used overlap with 50 k SNP positions for the original ST77 transformation event [21, 22]. Michno and Stupar BMC Biotechnology (2018) 18:38 Page 4 of 9 Indeed, 140 of the 143 SNPs and all 16 indels were clus- necessary to focus on the factors that inflated the overall tered at a single locus between positions 1.4 Mb and higher number of SNPs and indels discovered by their 2.2 Mb on chromosome 15 (Fig. 2). This cluster overlaps bioinformatic pipeline. While we would expect the au- with a previously reported region of heterogeneity in thors to identify polymorphisms due to the reasons out- ‘Williams 82’ . These results suggest that these vari- lined in the previous sections (e.g., the ‘Thorne’ ants are not associated with transgenesis, but represent background of series 764 and the genetic heterogeneity natural standing heterogeneity between the ‘Williams 82’ between ST77 and the control ‘Williams 82’ plants), the plant used to generate the ST77 transformation event reported polymorphism counts were unexpectedly high. and the ‘Williams 82’ individuals used as controls by For example, the plants in the 764 series averaged Lambirth et al. . 38,188 SNPs and 2390 indels per plant. This number will Therefore, after filtering for genotype identity and be higher than the other two transgenic series because it background heterogeneity, we found three SNPs each in is the ‘Thorne’ genetic background. However, the ST77 S77 and S111 that could not be explained by these fac- series averaged 21,666 SNPs and 1829 indels, and the tors. Follow-up analysis of S77 revealed one SNP within ST111 series averaged 20,208 SNPs and 1750 indels. Fur- an intron, one synonymous SNP within an exon, and thermore, the untransformed ‘William 82’ control plants one non-synonymous SNP within an exon (M to V exhibited counts of 20,707 SNPs and 1863 indels. There- amino acid change in the sixth exon of Gly- fore, this section is devoted to addressing the sources of ma.10G150500). Analysis of S111 revealed two SNPs these high estimates. within introns, and one non-synonymous SNP within an We retrieved the variant calls for each of the 36 sam- exon (T to G amino acid change in the fourth exon of ples used in their analysis (http://de.iplantcollaborative. Glyma.04G134800). org/dl/d/533570A3-1EFB-4864-B9A9-9D82F17E09A8/sn peffgenes.zip). Initial analyses of genotype calls revealed Source of variation in all transgenic series: Bioinformatics that there was a higher number of heterozygous variants handling and threshold parameters than homozygous variants for the alternate allele com- The previous two sections addressed our reanalysis of pared to the reference genome. ST77 and ST111 were RNA-seq data [21, 22], focusing on the subset of unique respectively advanced to the T8 and T4 generation be- SNPs and indels within any one transgenic series. fore sequencing. We can estimate the expected propor- However, the majority of the analysis reported, discussed tion of heterozygous variants in these generations if we and interpreted in the Lambirth et al.  paper (includ- assume the following: all of the mutations induced by ing the base substitution profile, the predicted effect of transgenesis were heterozygous in the T0 generation, the each polymorphism, and gene ontology enrichment ana- variants are not subject to segregation distortion, and lysis) used the original full set of SNPs and indels identi- the variants have negligible effects on organismal fitness. fied, rather than the “unique” subset. Hence it is Under these assumptions, we would expect Fig. 1 Reanalysis of series 764 reveals that its genetic background comes from genotype ‘Thorne’ rather than genotype ‘Williams 82’. 525 SNPs were identified that met two criteria: (1) they were consistently polymorphic between series 764 plants and the ‘Williams 82’ reference genome in the RNA-seq dataset; (2) they were previously genotyped across the USDA germplasm . A comparison of these SNPs to the all of the accessions in the USDA soybean accessions revealed ‘Thorne’ as a near-perfect match (99.2% identity), with a substantial gap to the next closest match (Washita at 74.2%). The reanalysis also confirmed that this panel of SNPs is completely polymorphic between the 764 series and ‘Williams 82’ (0% match) Michno and Stupar BMC Biotechnology (2018) 18:38 Page 5 of 9 Fig. 2 Distribution of unique SNPs across transgenic series ST77. The distribution of the 143 unique SNPs identified in ST77 is shown among the 20 chromosomes. Almost all of the ST77 SNPs (140 out of 143) cluster at a single locus on chromosome 15, which is a typical signature of genetic heterogeneity among the ‘Williams 82’ parental lines used in this study. The clustering of SNPs at specific, rather than random, positions is indicative of heterogenous standing variation that has previously documented in the ‘Williams 82’ cultivar  approximately 0.39% of the ST77 variants to be hetero- Figure S2). A quality score is represented on a log-based zygous at the T8 generation, and 6.25% of the ST111 Phred scale where, for example, a quality score of 10 in- variants to be heterozygous at the T4 generation. How- dicates that there is a 10% chance of the variant being ever, the retrieved data showed that 50.21 and 48.62% of incorrect and a quality score of 20 indicates that there is the variants were called as heterozygous for ST77 and a 1% chance of the variant being incorrect. Further in- ST111, respectively. The proportion of heterozygous vestigation into the authors’ methods revealed that the variants were far higher than what was expected, and variant calls lacked any type of depth or quality filter. were most likely false positives resulting from the ana- This further reinforces the likelihood that a large portion lysis method. of these variants at low depth and quality are most likely We further investigated whether the authors filtered false positives. their variants for read depth and/or quality. Although The experiments in these studies [21, 22] included the read depth alone is not sufficient to determine whether sequencing of nine samples per transgenic series (or the a variant is real, calls based on low read depth are more ‘Williams 82’ controls), consisting of three sibling seeds likely to be false positives than calls based on higher taken from three plants each. As mutations induced by read depths. False positives can arise from reads that transformation or tissue culture would presumably occur map poorly to the genome, or bases that are of low qual- in the T0 generation, one would expect the vast majority ity at the site of a polymorphism. When analyzing the of these loci to be fixed as homozygotes by the T4-T8 depth of variant calls for all 36 samples in the study, generations. Therefore, it may be intuitive to exclude 43.2% of variants were called at a depth of one read, and any variants that were not observed in all three siblings. 20.2% of variants were called with a depth of two reads While the authors reported on average ~ 20,000 SNPs (Fig. 3). Similarly, when analyzing the distribution of and ~ 1800 indels per individual plant for ST77, ST111, quality scores across all 36 samples, 55.3% of variant WT, and ~ 40,000 SNP’s and ~ 2400 indels per individual calls had a quality score of 10 or lower (Additional file 1: plant for 764 compared to the reference genome, the Michno and Stupar BMC Biotechnology (2018) 18:38 Page 6 of 9 Fig. 3 Depth of sequence coverage for all polymorphic variants (SNPs and indels) called in the Lambirth et al.  study. The polymorphic calls shown here were made between each sample and the reference genome ‘Williams 82’, without consideration for the uniqueness of the call among series or reproducibility among different plants within the series. Homozygous calls are shown in blue and heterozygous calls are shown in red. Each bar sums the number of polymorphisms across the nine plants that were called at each read depth (e.g., we are showing the ~ 211,448 total variants called in series ST77 across the nine plants; ST77 averaged 23,494 variants per plant). Note the relatively larger peak in the 21+ category for the 764 series compared to the other series; many of these (mostly homozygous) calls likely represent standing variants between lines ‘Thorne’ and ‘Williams 82’. The 21+ peaks in the other three groups (ST77, ST111, and ‘Williams 82’ controls) may derive from various factors, most obviously the clusters of variants that are found within heterogeneous regions of different sub-lines of ‘Williams 82’ majority of variants were detected as polymorphic in 2-bp indels (Additional file 2: Table S2) are likely a con- only one of the 36 samples in the study. Figure 4a shows sequence of the read mapping software and bioinfor- a comparison of the variants from three selected ST77 matics pipeline used . plants, each derived from a different T individual. In this case, over 20,000 variants were called for each plant, Conclusions but only 2807 of the variants were common across all In the present study, we re-examined an existing data three plants (Fig. 4a). Similar findings were observed for set that was previously used to report high mutation the ST77 “D” series siblings (all derived from a T plant counts from three transgenic plant series. We identified designated as “D”), in which a relatively small proportion three major factors that inflated the estimates of mo- (4356 out of 64,636) of the variants were in common to lecular variation in the transgenic plants from these all three siblings (Fig. 4b). These trends were observed studies. These factors included residual heterogeneity, across all sibling groups in the study (Additional file 1: genotype misidentification, and insufficient data filtering. Figure S3). Series 764 exhibited a greater proportion of The issue of genotype identity is obvious and intuitive, variants shared among the siblings, which would be ex- but requires caution, both for those handling and main- pected for a plant from a different genetic background taining the materials (e.g., seeds, tissue, DNA) and those than ‘Williams 82,’ i.e., these plants have more “true” se- handling the computational analysis. Errors in genotype quence variants that can be faithfully detected among identity can be diagnosed using strictly molecular ap- the different siblings. proaches, but situations where the identity of the Another indication of the high frequency of false pos- material has been compromised or misinterpreted itives called in the Lambirth et al.  study relates to can be problematic (see commentaries [34, 35]). The the structure of the indels that were called as poly- issue of genetic heterogeneity within lines and seed morphic. Of the 70,486 indels that were called, 52.9% stocks can create more subtle complications in analysis, of them were heterozygous and 59.6% of them had a as has been documented in the soybean line ‘Williams read depth of 3 or less. Interestingly, all of the indels 82’ . When properly accounted for, heterogenetity reported in the study exhibited polymorphisms that does not disrupt accurate analysis and interpretation. were either 1 bp insertions (22,809 calls), 2 bp inser- However, when not properly accounted for, this issue tions (8480 calls), 1 bp deletions (13,427 calls) or 2 bp may be problematic in assessing genomic, transcrip- deletions (25,770 calls). The high number of only 1- or tomic, and other types of variation. Within-line genetic Michno and Stupar BMC Biotechnology (2018) 18:38 Page 7 of 9 germplasm sources. This is particularly true for experi- ments on materials within the realm of biotechnology, as the findings may be used to inform regulatory agen- cies about the intended and unintended consequences of using these technologies. Evaluation for the presence of unintended changes at the DNA level continues to be a part of the safety evaluation for transgenic plants, and whole-genome sequencing has been proposed as a tool for this purpose . However, technical issues may make this problematic in crop species, which have com- plex, highly variable, and often heavily duplicated ge- nomes. Furthermore, as demonstrated by the present study, the analysis and interpretation of whole-genome sequencing data may be inconsistent among research groups. While Lambirth et al.  reported high rates of mutation in transgenic soybean lines, our reanalysis of their data concluded that there are relatively few se- quence variants detected in these lines that might be at- tributed to the transformation process. It will be difficult to standardize a regulatory methodology that accounts for every complication that will arise across research groups and species (e.g., standing genetic heterogeneity within a parental seed stock) that may be incorrectly at- tributed to the genetic transformation process. Methods Fig. 4 Number of overlapping polymorphisms in the Lambirth et al. Variant and indel detection  study. a Venn diagram showing of the number of sequence RNA-seq from  was downloaded from the National variants alternate to the reference genome that overlapped between Center for Biotechnology Information Sequence Read three T8 individuals derived from transgenic event ST77. Heterozygous and homozygous alternate calls are not differentiated in this analysis. b Archive using project number PRJNA271477 and reana- A similar Venn diagram of the number of polymorphism that overlapped lyzed as described below. Sequencing adapters and between the T7:8 siblings in the ST77D family low-quality bases were removed using Cutadapt with minimum read length set to 40 and quality cutoff set to heterogeneity can be an issue in many species, particu- 20 . Using the GATK Best Practices workflow for larly those in which a reference genome is presumed to RNA-seq [25, 26], reads were aligned to assembly version be perfectly representative of every individual in the seed two of the reference genome (Wm82.a2) from www.soy- stock. Lastly, data handling can be a major source of base.org using the STAR aligner . Read-group identifi- variation leading to inflated variant calls. Informatics cations were added and duplicate reads were marked pipelines generate large data sets, and users should be using Picard tools. Reads were then split into exon seg- aware of quality control measures, and commonly used ments, overhanging intronic segments were hard clipped, filtering parameters. Furthermore, experimental designs and mapping qualities were reassigned using the SplitNCi- that provide replicated samples or comparisons among garRead tool from the GATK Genome Analysis Toolkit near-isogenic materials (e.g., the sibling lines discussed with -RMQF set to 255 -RMQT set to 60 and enabling the in this study) can be used to further differentiate the -U ALLOW_N_CIGAR_READS flag . SNPs and indels high-confidence and low-confidence variant calls. were called using GATK HaplotypeCaller with the -don- While the present reanalysis focused specifically on tUseSoftClippedBases flag and -stand_call_conf set to 20. comparisons between transgenic lines, all the factors ad- The resulting VCF file was then split into separate files for dressed in this paper need also be considered when con- SNPs (Additional file 3) and indels (Additional file 4)and ducting any type of expression and/or genomic then filtered using VariantFiltrations from the Genome comparisons. This includes studies that focus on the Analysis Toolkit with parameters set to window of 35, effects of mutagenesis, on-target and off-target effects of cluster of 3, filter parameters of FS > 30, and QD < 2.0 for genome engineering technologies, assessments of stand- SNPs. Similar parameters were used for indel filtration, ing/natural variation, or other comparisons of except FS filter was set to > 200 for all 36 samples. Michno and Stupar BMC Biotechnology (2018) 18:38 Page 8 of 9 Variants that passed filtration were then used for down- Additional file 4: Resulting raw variant indel calls from GATK stream analysis. HaplotypeCaller. (VCF 29397 kb) Abbreviations Accession identification CRISPR: Clustered Regularly Interspaced Short Palindromic Repeats; Genotype calls from the filtered SNP list were extracted Indel: insertion-deletion; RNA-seq: RNA-sequencing; SNP: Single nucleotide using a custom python script then loaded into R statis- polymorphisms tical software. The dataset was filtered for homozygous Acknowledgements SNPs that are uniquely polymorphic to the reference The authors are grateful to Drs. Wayne Parrott, Tom Clemente, and Candice compared to the other transgenic lines and ‘Williams 82’ Hirsch for helpful suggestions and comments on this manuscript and Peter controls. SNPs were removed from the analysis if there Morrell and Fernanda Rodriguez for fruitful discussions. The authors are appreciative of the University of Minnesota’s Office of Information Technology was more than 33% missing data for a given line and if for providing data storage and the Minnesota Supercomputing Institute for there was no consensus genotype call between plants other computational needs. and replicates (Additional file 1: Figure S1). The result- ing SNPs were used to identify positions that overlapped Funding This work was supported, in part, by the United States Department of within the SoySNP50k iSelect BeadChip  VCF file Agriculture (Biotechnology Risk Assessment Project #2015–33522-24096). using the Wm82.a2 coordinates downloaded from www.Soybase.org. SNP calls for each of the 20,087 acces- Availability of data and materials The datasets analysed in the current study were originally generated and sions in the 50 k dataset were compared to the SNP calls made available in previous publications (see references [21, 22]). The for the 764 series to identify the accession with the high- software versions, options, thresholds and workflow details used for est level of SNP identity. sequence variant detection in the current study can be found at https:// github.com/MeeshCompBio/The_Other_WPT_Study Analysis of data from previous studies Authors’ contributions The Lambirth et al.  supplementary data was down- JM and RMS designed the analysis, performed the analysis, and wrote the manuscript. Both authors read and approved the final manuscript. loaded from http://de.iplantcollaborative.org/dl/d/53357 0A3-1EFB-4864-B9A9-9D82F17E09A8/snpeffgenes.zip, and Ethics approval and consent to participate each of the 36 samples VCF files were parsed for depth, Not applicable. quality, and genotype information using a custom Competing interests python script. The authors declare that they have no competing interests. Software and figures Publisher’sNote Parallelization of commands was run using GNU parallel. Springer Nature remains neutral with regard to jurisdictional claims in published Data that was generated using R statistical software was maps and institutional affiliations. plotted using the ggplot2 package . The genome distri- Received: 5 January 2018 Accepted: 18 May 2018 bution of SNPs was created by using Phenogram . References Data availability 1. Estruch JJ, Carozzi NB, Desai N, Duck NB, Warren GW, Koziel MG. Transgenic Software versions, options, thresholds, workflow details plants: an emerging approach to pest control. Nat Biotechnol. 1997;15:137–41. and custom scripts can be found at https://github.com/ 2. Padgette SR, Kolacz KH, Delannay X, Re DB, LaVallee BJ, Tinius CN, Rhodes WK, Otero YI, Barry GF, Eichholtz DA, et al. Development, identification, and MeeshCompBio/The_Other_WPT_Study. characterization of a glyphosate-tolerant soybean line. Crop Sci. 1995;35:1451–61. 3. Shah DM, Horsch RB, Klee HJ, Kishore GM, Winter JA, Tumer NE, Hironaka CM, Sanders PR, Gasser CS, Aykent S, et al. Engineering herbicide tolerance Additional files in transgenic plants. Science. 1986;233:478–81. 4. Vaughn T, Cavato T, Brar G, Coombe T, DeGooyer T, Ford S, Groth M, Howe A, Additional file 1: Figure S1. Pipeline to identify the background Johnson S, Kolacz K, et al. A method of controlling corn rootworm feeding genotype of 764. Figure S2. Quality scores for all polymorphic variants using a protein expressed in transgenic maize. Crop Sci. 2005;45:931–8. (SNPs and indels) called in the Lambirth et al.  study. Figure S3. 5. James C, Krattiger AF. Global Review of the Field Testing and Number of overlapping polymorphisms in the Lambirth et al.  study Commercialization of Transgenic Plants, 1986 to 1995: The First Decade of within each of the 12 sibling families studied. (PPTX 1455 kb) Crop Biotechnology. Ithaca, NY: ISAAA; 1996. ISAAA Briefs 1 6. Kessler DA, Taylor MR, Maryanski JH, Flamm EL, Kahl LS. The safety of foods Additional file 2: Table S1. SNP calls resulting from the data filtering developed by biotechnology. Science. 1992;256:1747–9. pipeline shown in Additional file 1: Figure S1, excluding the accession 7. Glenn KC, Alsop B, Bell E, Goley M, Jenkinson J, Liu B, Martin C, Parrott W, identification steps. The SNPs correspond to the top row in Table 1. Souder C, Sparks O, et al. Bringing new plant varieties to market: plant Table S2. Indel calls resulting from the data filtering pipeline shown in breeding and selection practices advance beneficial characteristics while Additional file 1: Figure S1, excluding the accession identification steps. minimizing unintended changes. Crop Sci. 2017;57:2906–21. (XLSX 2307 kb) 8. Weber N, Halpin C, Hannah LC, Jez JM, Kough J, Parrott W. Editor's choice: Additional file 3: Resulting raw variant SNP calls from GATK crop genome plasticity and its relevance to food and feed safety of HaplotypeCaller. (VCF 49825 kb) genetically engineered breeding stacks. Plant Physiol. 2012;160:1842–53. Michno and Stupar BMC Biotechnology (2018) 18:38 Page 9 of 9 9. Clark KA, Krysan PJ. Chromosomal translocations are a common 30. Haun WJ, Hyten DL, Xu WW, Gerhardt DJ, Albert TJ, Richmond T, Jeddeloh phenomenon in Arabidopsis thaliana T-DNA insertion lines. Plant J. 2010;64: JA, Jia G, Springer NM, Vance CP, Stupar RM. The composition and origins 990–1001. of genomic variation among individuals of the soybean reference cultivar 10. Nacry P, Camilleri C, Courtial B, Caboche M, Bouchez D. Major chromosomal Williams 82. Plant Physiol. 2011;155:645–55. rearrangements induced by T-DNA transformation in Arabidopsis. Genetics. 31. Fasoula VA, Boerma HR. Divergent selection at ultra-low plant density for 1998;149:641–50. seed protein and oil content within soybean cultivars. Field Crops Res. 2005; 91:217–29. 11. Endo M, Kumagai M, Motoyama R, Sasaki-Yamagata H, Mori-Hosokawa S, 32. Fasoula VA, Boerma HR. Intra-cultivar variation for seed weight and other Hamada M, Kanamori H, Nagamura Y, Katayose Y, Itoh T, Toki S. Whole- agronomic traits within three elite soybean cultivars. Crop Sci. 2007;47:367–73. genome analysis of herbicide-tolerant mutant rice generated by 33. Sun Z, Bhagwate A, Prodduturi N, Yang P, Kocher JA. Indel detection from agrobacterium-mediated gene targeting. Plant Cell Physiol. 2015;56:116–25. RNA-seq data: tool evaluation and strategies for accurate detection of 12. Jiang C, Mithani A, Gan X, Belfield EJ, Klingler JP, Zhu JK, Ragoussis J, Mott R, actionable mutations. Brief Bioinform. 2017;18:973–83. Harberd NP. Regenerant Arabidopsis lineages display a distinct genome-wide 34. Bergelson J, Buckler ES, Ecker JR, Nordborg M, Weigel DA. Proposal spectrum of mutations conferring variant phenotypes. Curr Biol. 2011;21:1385–90. regarding best practices for validating the identity of genetic stocks and the 13. Kashima K, Mejima M, Kurokawa S, Kuroda M, Kiyono H, Yuki Y. Comparative effects of genetic variants. Plant Cell. 2016;28:606–9. whole-genome analyses of selection marker-free rice-based cholera toxin B- 35. Lareau CA, Clement K, Hsu JY, Pattanayak V, Joung JK, Aryee MJ, Pinello L. subunit vaccine lines and wild-type lines. BMC Genomics. 2015;16:48. Response to Unexpected mutations after CRISPR-Cas9 editing in vivo. Nat 14. Kawakatsu T, Kawahara Y, Itoh T, Takaiwa F. A whole-genome analysis of a Methods. 2018;15(4):238-9. transgenic rice seed-based edible vaccine against cedar pollen allergy. DNA 36. Pauwels K, De Keersmaecker SC, De Schrijver A, du Jardin P, Roosens NH, Res. 2013;20:623–31. Herman P. Next-generation sequencing as a tool for the molecular 15. Labra M, Vannini C, Grassi F, Bracale M, Balsemin M, Basso B, Sala F. characterisation and risk assessment of genetically modified plants: added Genomic stability in Arabidopsis thaliana transgenic plants obtained by value or not? Trends Food Sci Tech. 2015;45:319–26. floral dip. Theor Appl Genet. 2004;109:1512–8. 37. Martin M. Cutadapt removes adapter sequences from high-throughput 16. Miyao A, Nakagome M, Ohnuma T, Yamagata H, Kanamori H, Katayose Y, sequencing reads. EMBnetjournal. 2011;17:10–2. Takahashi A, Matsumoto T, Hirochika H. Molecular spectrum of somaclonal 38. Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, Batut P, variation in regenerated rice revealed by whole-genome sequencing. Plant Chaisson M, Gingeras TR. STAR: ultrafast universal RNA-seq aligner. Cell Physiol. 2012;53:256–64. Bioinformatics. 2013;29:15–21. 17. Sabot F, Picault N, El-Baidouri M, Llauro C, Chaparro C, Piegu B, Roulin A, 39. McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Guiderdoni E, Delabastide M, McCombie R, Panaud O. Transpositional Garimella K, Altshuler D, Gabriel S, Daly M, DePristo MA. The genome landscape of the rice genome revealed by paired-end mapping of high- analysis toolkit: a MapReduce framework for analyzing next-generation DNA throughput re-sequencing data. Plant J. 2011;66:241–6. sequencing data. Genome Res. 2010;20:1297–303. 18. Schouten HJ, Vande Geest H, Papadimitriou S, Bemer M, Schaart JG, 40. Wickham H. ggplot2. Wiley Interdiscip Rev Comput Stat. 2011;3:180–5. Smulders MJ, Perez GS, Schijlen E. Re-sequencing transgenic plants revealed 41. Wolfe D, Dudek S, Ritchie MD, Pendergrass SA. Visualizing genomic rearrangements at T-DNA inserts, and integration of a short T-DNA information across chromosomes with PhenoGram. BioData Min. 2013;6:18. fragment, but no increase of small mutations elsewhere. Plant Cell Rep. 2017;36:493–504. 19. Zhang D, Wang Z, Wang N, Gao Y, Liu Y, Wu Y, Bai Y, Zhang Z, Lin X, Dong Y, et al. Tissue culture-induced heritable genomic variation in rice, and their phenotypic implications. PLoS One. 2014;9:e96879. 20. Anderson JE, Michno JM, Kono TJ, Stec AO, Campbell BW, Curtin SJ, Stupar RM. Genomic variation and DNA repair associated with soybean transgenesis: a comparison to cultivars and mutagenized plants. BMC Biotechnol. 2016;16:41. 21. Lambirth KC, Whaley AM, Blakley IC, Schlueter JA, Bost KL, Loraine AE, Piller KJ. A comparison of transgenic and wild type soybean seeds: analysis of transcriptome profiles using RNA-Seq. BMC Biotechnol. 2015;15:89. 22. Lambirth KC, Whaley AM, Schlueter JA, Piller KJ, Bost KL. Transcript polymorphism rates in soybean seed tissue are increased in a single transformant of Glycine max. Int J Plant Genomics. 2016;2016:1562041. 23. Paz MM, Martinez JC, Kalvig AB, Fonger TM, Wang K. Improved cotyledonary node method using an alternative explant derived from mature seed for efficient agrobacterium-mediated soybean transformation. Plant Cell Rep. 2006;25:206–13. 24. Schmutz J, Cannon SB, Schlueter J, Ma J, Mitros T, Nelson W, Hyten DL, Song Q, Thelen JJ, Cheng J, et al. Genome sequence of the palaeopolyploid soybean. Nature. 2010;463:178–83. 25. DePristo MA, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C, Philippakis AA, del Angel G, Rivas MA, Hanna M, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011;43:491–8. 26. Van der Auwera GA, Carneiro MO, Hartl C, Poplin R, Del Angel G, Levy- Moonshine A, Jordan T, Shakir K, Roazen D, Thibault J, et al. From FastQ data to high confidence variant calls: the Genome Analysis Toolkit best practices pipeline. Curr Protoc Bioinformatics. 2013;43:11.10.1–11.10.33. 27. Song Q, Hyten DL, Jia G, Quigley CV, Fickus EW, Nelson RL, Cregan PB. Fingerprinting soybean germplasm and its utility in genomic research. G3 (Bethesda). 2015;5:1999–2006. 28. McBlain BA, Fioritto RJ, St. Martin SK, AJ C-DB, Schmitthenner AF, Cooper RL, Martin RJ. Registration of ‘Thorne’ soybean. Crop Sci. 1993;33:1406. 29. Farno L, Keim KR, Edwards LH. Registration of ‘Washita’ soybean. Crop Sci. 2003;43:1125.
– Springer Journals
Published: Jun 1, 2018