Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Enrichment of genomic DNA for polymorphism detection in a non‐model highly polyploid crop plant

Enrichment of genomic DNA for polymorphism detection in a non‐model highly polyploid crop plant Introduction The advent of second‐generation DNA sequencing methods has increased sequencing throughput to such an extent that re‐sequencing of some genomes via the whole‐genome shotgun (WGS) approach has become relatively trivial (e.g. Arabidopsis where currently 240 strains have been sequenced as part of the 1001Genomes Project from a total of 1001 planned—). However, for the discovery of DNA polymorphisms across populations and/or in regions of interest in large genomes, the WGS approach is currently too inefficient and costly. For species without an assembled genome sequence where the exact target sequence is unknown, there is an additional level of complexity to assemble reads from the region(s) of interest. Early methods developed for selecting DNA of interest involved screening libraries of cloned DNA using hybridisation‐based assays ( Grunstein and Hogness, 1975 ). This can be a laborious task, especially for libraries from large target genomes such as the human genome ( Bell , 1980 ), and especially because screening of libraries is usually carried out using one or a few probes at a time. Another method in use well before the advent of next‐generation sequencing (NGS) is PCR, which has been an extremely successful approach for enrichment or capture of genomic targets ( Mullis and Faloona, 1987 ). PCR‐based approaches can be effective where a population is being screened to detect polymorphisms and generally where the size of target regions is relatively short and the number of regions relatively small in number (e.g. Bundock , 2009 ; Malory , 2011 ). However, a highly parallel PCR method, based on microdroplets and using specialised technology, has been used successfully to target a large number of regions ( Tewhey , 2009b ). Circularising probe systems that capture target genomic DNA combine specific probe sequences with universal PCR sequences to enable the multiplex amplification of large numbers of captured fragments for next‐generation sequencing (NGS) ( Dahl , 2005, 2007 ). These have evolved from the padlock probe system used for single nucleotide polymorphisms (SNP) genotyping ( Hardenbol , 2003, 2005 ). A recent iteration called molecular inversion probes (MIPs) (reviewed in Mamanova , 2010 ) has been used to capture target genomic DNA, which can then be sequenced directly (without additional library preparation) using a NGS instrument and has been demonstrated to capture 1 Mb of sequence using 13 000 MIPs ( Turner , 2009 ). To overcome PCR limitations, hybridisation capture‐based techniques have been successfully applied to enrich for target sequences. The technique of ‘direct selection’ was used initially for enriching cDNA fractions using large genomic clones ( Lovett , 1991 ; Morgan , 1992 ) but was later developed for selecting genomic DNA using biotinylated BAC DNA ( Bashiardes , 2005 ). A recent approach has been the use of microarrays to capture targets through preferential hybridisation. The captured DNA is eluted from the microarray and sequenced ( Albert , 2007 ; Okou , 2007 ). A modification of this approach has been to carry out the hybridisation in solution using the microarray as a platform for oligonucleotide synthesis ( Gnirke , 2009 ; Tewhey , 2009a ). The human exome has been the target for enrichment strategies by some groups, and this large and dispersed subset of the human genome has been successfully targeted using microarray‐based capture ( Hodges , 2007 ). One goal for the capture and sequencing of human exomes has been to identify the causes of genetic disorders, which has been demonstrated in principle by Ng (2009) and then demonstrated for two Mendelian‐inherited disorders—Millers syndrome ( Ng , 2010b ) and Kabuki syndrome ( Ng , 2010a ). An example of array‐based capture used successfully on a plant species is described by Fu (2010) where two sequential array hybridizations were carried out on maize. The first hybridisation was carried out with a repeat‐based array, and the second with the target array. This was designed to create a blocker‐free capture protocol because a large proportion of the maize genome is composed of repeat sequences. For many plant species of economic importance, a complete genome sequence is not yet available, so there is no reference genome for the design of capture probes. For those wishing to discover DNA sequence polymorphisms en masse , obtaining sufficient coverage of selected regions using WGS sequencing could be prohibitively expensive. An alternative solution is explored in this paper, using a genome sequence from a close relative both for the design of probe sequences for targeted enrichment and for the alignment of the resulting enriched reads. This approach greatly extends the number of species to which targeted enrichment strategies will be possible. Results and discussion Creation of sequencing libraries and mapping of sequencing reads Two sugarcane genotypes were chosen for the SureSelect targeted enrichment procedure: IJ76‐514, a pure Saccharum officinarum clone originally sourced from Irian Jaya, and Q165, a commercial cultivar that is a backcrossed hybrid of S. officinarum with Saccharum spontaneum . These genotypes are parents of the PJ2 mapping population ( Aitken , 2005 ). A WGS library for Illumina Genome Analyser (GA) analysis was constructed for each genotype with each library run on a single GA lane producing 76‐bp paired‐end sequence reads ( Table 1 ). The proven WGS libraries were then subjected to the Agilent SureSelect enrichment procedure and again each run on a single GA lane with 76‐bp paired‐end reads on an Illumina GAIIx slide. After trimming and discarding low‐quality sequence and depleting those reads that matched to chloroplast, mitochondrial or known repeat sequences, there were between 26 and 41 million reads acquired for each genotype/library (1.65 to almost 3 gigabases of sequence) ( Table 1 ). A library derived from a Sorghum bicolor genotype (R931945‐2‐2) was enriched using the same probes for comparative purposes. All sugarcane sequences have been lodged at the Short Read Archive at the National Center for Biotechnology Information (NCBI, SRA ) with the SRA accession number SRA051387.1 . 1 Summary of DNA sequencing of two sugarcane genotypes (Q165 and IJ76‐514) and a sorghum genotype Genotype/Library type No. of reads obtained (millions) Total length of sequences (Gb) No. of reads after quality trimming (millions) No. of reads after depletion of repeats Total no. of bases after depletion (Gb) IJ76‐514 WGS 30.9 2.06 27.2 26.6 1.65 Q165 WGS 38.7 2.48 32.7 31.9 1.86 IJ76‐514 enriched 57.9 4.29 56.2 41.0 2.98 Q165 enriched 51.7 3.81 50.3 38.0 2.77 Sorghum enriched 66.7 3.79 62.2 46.9 2.59 WGS, whole‐genome shotgun. Enriched—library created using the Agilent SureSelect targeted enrichment process. The WGS sequences from the two sugarcane genotypes were aligned to the sorghum genome with around 60% of WGS reads from both genotypes mapping to this genome and more than 30% mapping to genes using low stringency thresholds ( Table 2 ). This indicates the high degree of similarity between the sugarcane and sorghum genomes and adds weight to the plausibility of enriching the sugarcane genome based on probes designed to sorghum DNA sequence. The subsequent enrichment of these libraries using probes designed mainly to sorghum coding regions increased the proportion of reads that mapped to the sorghum genome, particularly the genes, where more than 50% of reads aligned at low stringency ( Table 2 ). Under stringent alignment conditions, the percentage of reads mapping to the sorghum genome was twice as high for the enriched libraries compared with the WGS libraries, and for mapping to the genic regions, there was at least a threefold increase when the enriched library was compared with the corresponding WGS library for both genotypes ( Table 2 ). This was an initial indication that the enrichment process was successful in capturing the targeted genic regions. 2 Mapping of reads to the sorghum genome sequence Genotype/Library type Percentage of reads mapping to sorghum genome * Percentage of reads mapping to sorghum genes IJ76‐514 WGS 57 (16) 31 (9) Q165 WGS 60 (17) 34 (11) IJ76‐514 enriched 75 (35) 53 (29) Q165 enriched 76 (39) 58 (33) Sorghum enriched 98 (67) 83 (54) WGS, whole‐genome shotgun. Enriched—library created using the Agilent SureSelect targeted enrichment process. * Percentages in parentheses are for reads that mapped uniquely under stringent mapping conditions of length = 0.9 and similarity = 0.9; otherwise, default parameters of mapping non‐unique matches, length = 0.5 and similarity = 0.8 were used. The extent to which the method of enrichment has succeeded in capturing sequences with homology to the probe sequences can be partly judged by the enrichment factor. This was calculated empirically for the sugarcane genotypes as the ratio of the proportion of the reads from the SureSelect enriched library that mapped to a probe sequence to the proportion of reads from the corresponding WGS library that mapped to a probe sequence. Using reads after quality trimming, the enrichment factor for IJ76‐514 was 10.0‐fold and for Q165 was 11.3‐fold based on read mapping to all probes ( Table 3 ). To determine the success of the strategy to use probes based on sorghum gene sequences to capture sugarcane sequences, the enrichment factor was calculated for a set of these probes alone. For IJ76‐514, the enrichment was ninefold, and for Q165 it was 12‐fold ( Table 3 ). Thirty‐four per cent of sorghum R931945 reads matched to the sorghum baits compared with around 25% for Q165 and 19% for IJ76‐514 enriched libraries ( Table 3 ). The enrichment of the sorghum sample could not be calculated empirically as there was no WGS sequence for comparison; however, the fraction of the sorghum genome covered by these sorghum probe sequences is 0.6%. As the length fraction parameter for read mapping used here was 0.5, 76‐bp reads will map if they overlap the 120mer probe sequences by 38 bp of sequence. This effectively extends the probe coverage by 76 bp (38 bp at each end) to 196 bp. This extends the coverage of the probes to 1% of the genome and the fold enrichment would be 34‐fold, with a theoretical maximum of 100‐fold enrichment. None of these calculations take into account the length of the inserts that have been captured, which had a mean length of 230 bp for IJ76‐514, 180 bp for Q165 and 280 bp for sorghum. In the case of IJ76‐514 and sorghum, the mean inserts would further extend the target region on either side of each probe and so the estimates for enrichment are conservative. For the probes designed to tile across sucrose metabolism genes and 242 gene fragments previously targeted using 454 sequencing ( Bundock , 2009 ), the enrichment for both IJ76‐514 and Q165 was 35‐fold ( Table 3 ). Clearly relative to the enriched sorghum library, the level of enrichment was greater in the sugarcane genotypes for these genes, likely due to higher homology. The level of enrichment was also greater than that found for the probes designed to sorghum sequences but also possibly assisted by the tiling strategy (see Experimental procedures ) and the fact that there were extra copies of these probes for hybridisation in the SureSelect Oligo capture library (see Experimental procedures ). 3 Reads mapping to 120‐bp target probe sequences used for SureSelect enrichment Genotype/Library type Percentage of reads mapping to all probe sequences * Percentage of reads mapping to 34 105 probes designed to sorghum genes only Percentage of reads mapping to 961 probes 2× tiled across sugarcane genes Percentage of reads mapping to 7263 SureSelect probes designed to sugarcane ESTs IJ76‐514 WGS 5.1 2.2 0.7 2.3 Q165 WGS 5.1 2.1 0.6 2.4 IJ76‐514 enriched 50 (×10) 19 (×8.9) 23 (×33) 12 (×5.0) Q165 enriched 57 (×11.2) 25 (×11.9) 21 (×35) 14 (×5.8) Sorghum enriched 42 34 (×34) 3.4 2.7 WGS, whole‐genome shotgun. Enriched—library created using the Agilent SureSelect targeted enrichment process. *Figures in parentheses are the enrichment factor obtained by dividing the enriched percentage by the corresponding WGS percentage, or for sorghum calculated from expected coverage. Discovery of single nucleotide polymorphisms The main aim of this project was to discover SNPs in a large number of sugarcane genes and in addition to reads covering probes, sequence from either side of the target probe is also expected to be captured during enrichment. As the majority of the SureSelect probes were based on the sorghum genome, this genome was chosen as the main reference for aligning the Illumina reads from all libraries for SNP detection. SNP discovery was carried out on the reads aligned to the ten chromosomes of the sorghum genome sequence after depleting reads that matched to chloroplast, mitochondrial or known repeat sequences. The main focus was to find SNPs between the homologous loci within each sugarcane genotype as these provide the potential for mapping ( Bundock , 2009 ). Large numbers of candidate SNPs were identified from the sequence alignments and although fixed SNPs between sugarcane and sorghum were readily identified, interestingly these were in the minority ( Table 4 ). Even with stringent parameters for SNP detection, an enormous number of putative SNPs within each sugarcane genotype were detected, particularly from the enriched libraries with more than a quarter of a million SNPs detected as polymorphisms within each sugarcane genotype ( Table 4 ). For the enriched libraries, the SNPs were found in reads mapping to a large number of sorghum genes (almost 13 000 for IJ76‐514 and nearly 16 000 for Q165). This compares to a much smaller number of genes with SNPs for the WGS libraries (around 400), indicating that the enrichment procedure has worked to enrich for reads from the targeted genes ( Table 4 ). When the reads from the Q165 and IJ76‐514 enriched libraries were combined and mapped to the sorghum genome, the number of SNPs detected increased to 446 329 for SNPs within sugarcane, with 230 692 of these in predicted genes. There were in addition 232 270 fixed SNPs between sugarcane reads and the sorghum reference. 4 Putative SNPs detected for each of the two parental sugarcane genotypes for the two different libraries (WGS versus enriched) after reads were aligned to the Sorghum bicolor genome sequence Genotype/Library type Library size (millions of reads) Total no. of putative SNPs * No. of SNPs within sugarcane reads only Sugarcane SNPs within annotated genes Genes with one or more SNP sites No. of predicted amino acid changes IJ76‐514 WGS 30.9 59 417 50 947 3169 384 341 Q165 WGS 38.7 65 808 56 839 3276 392 334 IJ76‐514 enriched 57.9 434 633 268 797 139 725 12 963 51 843 Q165 enriched 51.7 430 549 285 836 168 090 15 799 64 098 WGS, whole‐genome shotgun; SNPs, single nucleotide polymorphisms. Enriched—library created using the Agilent SureSelect targeted enrichment process. * SNP numbers are from SNP discovery analyses using minimum read coverage at a site of 20 and minimum 5% frequency of alternative allele. To ensure that the large differences in the number of SNPs detected between WGS and enriched libraries was not due mainly to differences in the number of reads obtained from each library and also to equitably compare SNP numbers between genotypes, 25 million reads were randomly selected from each library and used for SNP detection ( Table 5 ). This clearly showed that there was a large increase in the number of SNPs (4–5.5‐fold) detectable in the reads from the enriched libraries compared to WGS libraries, particularly in genic regions (36–60‐fold). As this could not be attributed to differences in the number of reads in this case, it is likely to be due to the targeting of these reads as a result of the enrichment process. It could also be seen that although there were a larger number of SNPs detected in IJ76‐514 from WGS compared with Q165 (15% more), the Q165 enriched library provided a larger number of SNPs than the IJ76‐514 enriched library (18% more). About 61% of SNPs detected from the Q165 enriched library occurred within annotated genes compared with 54.6% for the IJ76‐514 enriched library. For the WGS libraries, the corresponding proportions are 5.5% and 6%, respectively. The larger number of SNPs detected in IJ76‐514 compared with Q165 from the same number of WGS reads ( Table 5 ) may not necessarily be due to a higher frequency of SNPs within this genotype. A difference in genome size may contribute so that fewer SNPs cross the sequence depth threshold for Q165. The larger number of SNPs in Q165 from enrichment may also not be due to greater SNP frequency but may be largely due to higher proportion of capture of targets owing to greater similarity between designed probes and the Q165 genome, possibly due to the S. spontaneum DNA represented in this genome being more similar to sorghum sequence. Evidence for this interpretation comes from the fact that a higher proportion of reads from Q165 (WGS) map to the sorghum genome ( Table 2 ). 5 Putative SNPs detected from 25 million randomly selected reads for each of the two parental sugarcane genotypes for the two different libraries (WGS versus enriched) Genotype/Library type Putative SNPs within sugarcane genotype only * Putative SNPs in annotated genes IJ76‐514 WGS 46 616 2825 Q165 WGS 40 582 2246 IJ76‐514 enriched 188 340 102 892 Q165 enriched 222 666 136 042 WGS, whole‐genome shotgun; SNPs, single nucleotide polymorphisms. Enriched—library created using the Agilent SureSelect targeted enrichment process. * SNP numbers are from stringent SNP discovery analyses using minimum read coverage (20) with minimum of 5% frequency of alternative allele. A random selection of reads, for subsets of different sizes, was sampled from both the WGS and enriched Q165 libraries and used for SNP analysis. A plot of the number of SNPs detected versus the number of reads in the subset indicated the different trends for the two libraries ( Figure 1 ). The trend line for the enriched library indicates that the number of SNPs detected is approaching saturation from the single lane of sequence, whilst the trend line for the WGS library is still climbing steeply. This indicates that for the SNPs detected from the enriched library, which constituted SNPs mainly within the targeted genic region, most had been detected, whilst for the WGS library, SNPs from across the whole genome are sampled randomly and constitute a much larger total pool of SNPs that would require many lanes of sequence before saturation is observed. 1 Plot of the number of SNPs discovered within the DNA sequence reads from the sugarcane genotype Q165. Reads were selected randomly from the total set in subsets of various sizes and aligned to the sorghum genome for SNP discovery. SNP numbers are from SNP discovery analyses using low minimum read coverage (4) with minimum of 1% frequency of alternative allele. (a) Number of SNPs discovered in reads from the SureSelect enriched library for Q165. The trend line indicates that the curve is flattening out and reaching saturation. (b) SNPs discovered from the whole‐genome shotgun (WGS) reads from Q165. The trend line is concave up, indicating that as expected the genome is far from saturated with reads. Most WGS SNPs are not located in target regions. An analysis was carried out to determine the extent to which earlier 454 sequencing of PCR products and cross‐genotype SNP discovery could contribute to SNP validation. As part of the SureSelect probe design, a set of probes were tiled across short products (approximately 200–400 bp) that had been sequenced from the same two genotypes using 454 technology ( Bundock , 2009 ). Illumina SureSelect reads, and separately 454 reads, from both sugarcane genotypes were aligned to 205 of these sequences (derived from 242 which had been formed into contig consensus sequences). For Q165, 1906 SNPs were detected within these regions using the same stringent SNP detection parameters used for the discovery of SNPs from reads aligned to the sorghum genome. One thousand three hundred and 23 of these SNPs were discovered as sequence variants within Q165 from the 454 reads (69.4%). A further 332 of the 1906 Q165 SNPs were discovered within IJ76‐514 Illumina reads, and a further seven SNPs could be found in IJ76‐514 454 reads. So of the 1906 SNPs found in Q165 SureSelect sequence, there was independent sequence validation for 87% (1653). For IJ76‐514, 1428 SNPs were discovered from aligning the Illumina SureSelect reads to the 205 consensus sequences, with 1298 of these SNPs confirmed as present in other sequence—mostly 454 sequence from IJ76‐514—providing validation for 91%. These high validation percentages for these subsets, 87% for Q165 and 91% for IJ76‐514, should be applicable to, and give a high degree of confidence in, the SNPs discovered from alignment of SureSelect reads to sorghum genes generally. A cross‐genotype validation, for all the SNPs discovered in genes from the Illumina SureSelect reads, was carried out for both genotypes. From the Q165 enriched library, of the 168 090 SNPs found to be within sorghum‐defined genes ( Table 4 ), 81 368 (48.4%) were also identified as SNPs in the IJ76‐514 enriched library. These shared SNPs constitute almost 50% of the Q165 SNPs in genes and almost 60% of IJ76‐514 SNPs in genes. There are a total of 226 447 different SNP sites found in genes when results from separate genotype analyses are combined. Again this cross‐genotype validation provides confidence in the SNPs discovered. Some proportion of these SNPs may represent sequence differences between paralogous loci and not be allelic. One way to remove most of the SNPs that are due to sampling two paralogs is to set a threshold of more than 55% sequence representation for the major allele (for bi‐allelic SNPs). This will remove those SNPs where there is almost equal representation from the two alleles which would occur where two paralogs have been sampled (assuming equal number of chromosome copies for each paralog). For Q165, of the 163 248 bi‐allelic SNPs in genes, 154 733 (approximately 95%) have a major allele with >55.0% frequency, and thus only a small proportion would be excluded using this suggested threshold. For all the bi‐allelic SNPs detected within predicted genes, the frequency of the minor allele was calculated, and a frequency histogram created for each genotype ( Figure 2 ). SNPs with a coverage of 50–400 were selected for this because at low coverage artifactual peaks arise owing to the large number of SNPs in this category and the small number of possible frequencies. At the high end of coverage (400 used arbitrarily here), SNPs are more likely to result from sequences that are repeated in the Saccharum genome but are unique in the S. bicolor genome and will tend to have low minor allele frequency. There is a higher proportion of SNPs for Q165 than for IJ76‐514 where the minor allele is at the rare end of the spectrum ( Figure 2 )—with more than 43% of SNPs having a minor allele frequency of 0.05–0.127 in Q165 compared with 36% in IJ76‐514. This higher proportion of rare SNPs was observed in an earlier study of SNPs in gene fragments in these same sugarcane genotypes ( Bundock , 2009 ). A possible explanation for this observation is that the S. spontaneum chromosomes of the Q165 genome would be likely to have many sequence differences from the S. officinarum chromosomes providing numerous SNPs on a minority of chromosomes. In addition, the number of chromosome copies is likely to be larger for Q165 than for IJ76‐514. This would lead to a larger proportion of SNPs with the minor allele being rare compared with IJ76‐514. 2 Histogram showing the percentage of SNPs with the corresponding minor allele frequency. Pertains to bi‐allelic SNPs found at a coverage threshold of 50–400 within genes from the (a) IJ76‐514 enriched library (b) Q165 enriched library. The SureSelect approach has been highly successful with regard to the discovery of SNPs from the enriched reads aligning to genes. With an enrichment of 10–11‐fold, this method is economical, as the cost of library enrichment per library is much less than the cost required to produce the equivalent amount of sequence to cover the genic regions to an equivalent read depth. The method has allowed discovery of very large numbers of SNPs and should allow the mapping of a significant proportion of the genes in this complex genome. This strategy will enable high‐density SNP maps of the sugarcane genome to be constructed, which may be the key to successful assembly of a reference genome sequence for sugarcane ( Souza , 2011 ). As candidates for mapping, the most useful SNPs are single dose in one parent. This occurs where one allele is represented on one chromosomal copy only and the alternative allele occurs on the other copies in the parent genome, with the second parent having only the major allele present on all chromosome copies. Ideally, the rare allele is absent from the other parent, and a single‐dose marker inherited from one parent can be developed, which can segregate 1 : 1 in the progeny and be fully informative for mapping (i.e. all progeny are informative). Selecting SNPs for assay development that have one allele at a low frequency is a prudent strategy for developing these single‐dose markers ( Figure 3 ). Ideally, the frequency would correspond to the reciprocal of the number of chromosomal copies—which for commercial sugarcane varieties is usually unknown. As the DNA sequences analysed here are publicly available, this now represents hundreds of thousands of sugarcane SNPs that are available for sugarcane researchers to use in mapping and association studies. 3 An alignment of reads from the SureSelect enriched library of Q165 to a region of sorghum chromosome 8 with SNPs highlighted. There is a C/A SNP at position 19 205 (see arrow) where the cytosine minor allele (C in blue) is at 8.6% frequency (34/360). This is a good candidate SNP for single‐dose marker development as the likely ratio of minor allele to major is 1:11, and thus, the SNP is likely to be present on one chromosome copy with 11 copies having the alternative allele. Two bases upstream at 19 207, there appears to be a difference between sugarcane (G in yellow) and the sorghum reference (A). Design of probe sequences for hybridisation during capture There were five different sources of probe sequence design for the baits used for capture during enrichment ( Table 6 ). The most numerous were single probes designed to the coding region of predicted sorghum genes, with the next most numerous being probes designed to sugarcane expressed sequence tags (ESTs) that appeared not to be represented in the predicted sorghum genes. A third group of probes were designed to genomic sequence corresponding to selected fragments (202–445 bp in length) of 240 sugarcane genes included in a previous study. To compare the effectiveness of probe design, each probe sequence was used as a reference for read mapping. The GC content of probe designs was found to be highly significantly associated with the average read coverage for a probe. Probes with a GC content of around 30%–40% had the highest average read coverage ( Figure 4 ), a property that is probably directly related to DNA capture performance. 6 Design of probes used for enrichment of sugarcane and sorghum genomic libraries Target sequence Number of probes Number of times represented on manufacturing array Design strategy Predicted sorghum coding sequences—28 008 in total 44 429 1 Either a probe at both the 5′ and 3′ ends,or a probe at one end or mid sequence Sugarcane ESTs 7263 1 1 centrally located probe Sucrose metabolism genes—genomic sequence—sugarcane 757 4 2× tiling Sucrose metabolism genes—genomic sequence—sorghum 204 4 2× tiling 240 sugarcane gene fragments from 454 genomic DNA sequence 595 3 1× tiling Total 53 248 4 Graph of the GC content versus the average coverage of probes (baits) with reads mapped from the Q165 enriched library. The mean of the average coverage across the 120mer probes for each 10% band (25%–35%, 35%–45% etc.) has been calculated. Error bars are ± 1 SE of the mean. It can be observed that high GC content probes tend to have reduced coverage compared to probes with 25%–45% GC content which are optimal. For the probes designed to the sorghum coding regions, there were three locations used for design: extreme 5′ end, middle and extreme 3′ end. An analysis was undertaken to determine whether there was any effect of position on the ability of probes to capture sequences from hybridisation. However, it was found that probes designed at the start of a predicted coding region (5′ end) had a significantly higher GC content ( P < 0.0001, ANOVA) than those designed to the middle or the 3′ end of the coding region ( Figure 5 ). So a model that included probe position, percentage GC content of probe and the interaction was fitted for the sorghum probes to explain the average coverage for each probe and used to analyse all three enriched libraries. For the enriched sorghum library, only GC content was found to be a statistically significant influence on the capture efficiency, and the position of the probe within the coding DNA sequence (CDS) and the interaction were not significant. For both of the sugarcane genotypes, all three terms were significant and probes designed at the extreme 5′ end had higher average coverage when GC content was taken into account than probes at the extreme 3′ end. Based on a smaller number of probes designed to the middle of sorghum coding regions, there was no significant difference in performance to probes at the 3′ end for either sugarcane genotype. 5 For probes designed to the coding region of predicted sorghum genes (the majority of probes), the region to which the probe was designed was found to significantly influence the GC content of probes on average, with 5′ end sequences leading to probes with higher GC content on average (a). These probes were also less effective on average at capturing sequence as judged by the number of reads that mapped to the probe sequences after sequencing enriched libraries. The case for sorghum is presented (b) with a similar result found for both the sugarcane enriched libraries. Error bars are 95% confidence intervals. This analysis was carried out after removing the 1% of probes with the highest coverage to reduce undue influence of very highly repeated regions. The likely explanation for these observations is that 5′ end probes tend to represent more conserved sequences than probes at the 3′ end, so that when there are sequence differences between probe and target, the capture of targets and/or read alignment during mapping is reduced. This is supported by the fact that an analysis of SNP distributions indicated that there was a very highly statistically significant difference between SNP abundance between the three locations with the 3′ end having a much higher number of SNPs than expected and the 5′ end and middle having a smaller number of SNPs for all three enriched libraries (chi‐square test, P << 0.0001). Even though probes at the 5′ end had a higher coverage when GC content was taken into account, the GC content of 5′ end probes tends to be too high for the design of optimal probes (and they also miss out on capturing as much variation as 3′ end probes). The best recommendation therefore is to design a 3′ end probe as a matter of course because it is likely to have a GC content closer to optimum and also likely to encompass regions with variability. However, as insurance, if possible, a 5′ end probe could be designed to allow for those situations where there is too much variability at the 3′ end and there may be sufficient capture to enable alignment and detection of sequence variation. The caveat would be to design both probes, if possible, to a location where the GC content is below 55%, to reduce the number of poorly performing probes. If WGS sequence of the target species has been obtained, it could be used profitably during the probe design stage to determine whether any potential probe sequences have captured repetitive elements by mapping WGS reads to prospective probe sequences and culling those with high or very high coverage as belonging to repetitive elements. As expected, the capture of sorghum sequences from the sorghum library with probes designed to the coding region of sorghum genes was more effective than the capture of sequences with probes designed to sugarcane ESTs and vice versa ( Figure 6 ). The same effect was observed for the probes tiled across sucrose metabolism genes—those designed to sorghum genes had a higher coverage in the sorghum enriched library than those designed to sugarcane and vice versa for the sugarcane libraries (data not shown). The probes designed to tile across genes or fragments had a higher average probe coverage than the non‐tiled probes although this effect is confounded with the additional number of times these probes were represented on the array for the manufacture of the baits. 6 The read coverage, after mapping reads to the probe sequences used for sequence capture for enrichment, is strongly influenced by the GC content of the probe sequences. After removing the effect of GC content of probes, there is a strong effect in the residuals, of probe sequence origin—whether sorghum or sugarcane. For the enriched sorghum library, the read coverage is significantly higher for probes derived from sorghum coding DNA sequence (CDS) compared to sugarcane ESTs (a). For both sugarcane libraries, the opposite is the case—probes designed to sugarcane ESTs have higher coverage on average than those designed to sorghum CDS (b—the Q165 genotype). An almost identical graph to b was obtained for IJ76‐514. The error bars represent 95% confidence intervals for the means. The sequencing of gene‐enriched genomic DNA complements other strategies that employ sequencing to discover genome variations. Amplicon sequencing has been shown to allow the analysis of variation in many loci in genotypes of interest ( Bundock , 2009 ; Kharabian‐Masouleh , 2011 ), providing efficient SNP discovery. This more targeted SNP discovery approach is limited to much smaller numbers of loci than possible with the method reported here. Transcriptome sequencing ( Winfield , 2010 ) is useful but will require analysis of many different tissues and developmental stages to achieve the level of genome coverage possible with enriched genome sequencing. In conclusion, we have demonstrated the application of a solution hybridisation capture‐based method targeted to enrich genes in the sugarcane genome based mainly on gene sequences found in the recently sequenced sorghum genome. The use of related genomes for design of probes might allow this approach to be widely applied in polymorphism discovery in poorly described species for which a closely related reference genome is available. Experimental procedures Design of probes for enrichment Predicted coding regions from S. bicolor gene models were sourced from Phytozome v4.0 (). Sugarcane tentative consensus sequences (TCs), representing sugarcane transcripts, were downloaded from the Sugarcane Gene Index 2.2 (). Additional sugarcane sequences were retrieved from the Representative Public identifiers corresponding to the probe sets on the Sugar Cane Whole Genome Array (). The 120 base sequences were extracted from each sequence using text functions in Microsoft Excel. Simple repeats and low complexity sequences with each 120‐base pair sequence were identified using Repeatmasker (). Homology searching using the BLAST algorithm ( Altschul , 1990 ) was performed on local customised versions of the Sugarcane Gene Index 2.1, the Oryza sativa (rice) coding regions and the S. bicolor gene models, hosted at the CSIRO Bioinformatic Facility, Canberra, ACT, Australia. All 36 338 S. bicolor gene model coding regions (Sb_cds) were retrieved from Phytozome v4.0. The first 120 bases and last 120 bases (left_120mer and right_120mer, respectively) were extracted from each. Each 120mer was assessed for simple repeats and low complexity as well as for the presence of Ns in the sequence. 33 152 of the left_120mers and 35 165 of the right_120mers did not contain either Ns or low complexity sequences. Identical oligos from highly related Sb_cds were removed from both sets. This resulted in the retention of 31 770 left_120mers and 33 829 right_120mers, which represented a total of 34 965 Sb_cds. After filtration of the results of BLASTn homology against the Sugarcane Gene Index 2.2 and the Oryza sativa (rice) coding regions, it was determined that 7936 of the S. bicolor gene models were not represented by a suitable 120mer. These gene models were targeted by homology searching their sequences against the Sugarcane Gene Index. 1080 of the gene models had alignments in excess of 120 bases long. Matching 120mer oligonucleotides were designed, commencing at the start of the matching sequence, reducing to 981 after filtration to remove 120mers containing Ns (Table S1). Two additional approaches were used to maximise the number of individual useful probes designed. The first involved using the Representative Public identifiers corresponding to the probe sets on the Affymetrix Sugar Cane Whole Genome Array (depleted for control and chloroplast‐related probe sets) to identify 7328 unique sugarcane sequences for further examination. These were homology searched against the S. bicolor gene models that had not yet been represented by a 120mer. Of the 770 matches returned, once alignment length and duplicate matches had been taken into account, 284 additional 120mers were designed, commencing at the start of the match to S. bicolor sequence. The second approach targeted the 27 728 singletons and 7263 TCs that were not homologous to the 120mers designed so far. An attempt was made to design 120mers to each of the sequences, commencing at base 101 to avoid possible poorer‐quality sequence at the extreme 5′ end of each sequence. Once the resulting oligonucleotides were filtered to remove sequences that were <120 bases long, 7134 new 120mers remained (Table S2). The number of 120‐base S. bicolor and sugarcane oligonucleotides designed and validated from all approaches described above was 51 847, reducing to 51 692 after a final check to detect oligonucleotides with inadvertent ambiguities. A set of probes were also designed to tile across (two‐times tiling) eight genes for sucrose metabolism—the five genes for sucrose phosphate synthase I–V ( SPS I–V ), soluble acid invertase ( SAI ) and sucrose synthase 1 and 2 ( SuSy ). Two‐times tiling involves designing a set of probes that lie end to end and span the target sequence with a second set of probes offset (overlapping the first set) by 50% of the length of a probe (60 bases in this case) also lying across the length of the target. The genomic DNA sequence of four sugarcane SPS genes ( SPS II–V ) was used from sequences obtained at Southern Cross University. The sequence used for the design of probes for SPS I was obtained from sorghum along with SuSy1 . A set of 240 gene fragments amplified in an earlier experiment and sequenced using 454 technology ( Bundock , 2009 ) were also included for probe design using a two‐times tiling approach (Table S3). Table 6 shows the number of probes designed for each category, and the number of times each probe was represented on the array for manufacture, to create the bait sequences for hybridisation. Preparation and sequencing of sugarcane genomic DNA libraries Whole‐genome shotgun libraries from the genomic DNA of Q165 and IJ76‐514 were prepared using the Illumina DNA paired‐end library prep kit. The Covaris S2 was used to shear the genomic DNA before adapter ligation. Size selection was carried out on an Invitrogen E‐gel 2% with a target size of around 320 bp. Analysis on an Agilent Bioanalyser indicated that the size range of the vast majority of fragments was from 290 to 410 bp for IJ76‐514, whilst for Q165 the size range was from 220 to 380 bp post‐PCR (i.e. including 120 bases of adapter sequence). Aliquots of WGS libraries for Q165, IJ76‐514 and Sorghum R931945‐2‐2 were used for the SureSelect procedure that was carried out according to the protocol from Agilent (SureSelect Target Enrichment System, Illumina Paired‐End Sequencing Platform Library Prep, Protocol Version 1.0, September 2009) and using the SureSelect reagents supplied by Agilent (Agilent Technologies Inc., Santa Clara, CA). Paired‐end sequencing was carried out using the Illumina GA IIx with 76‐base read length. Analysis of illumina sequence data Illumina sequence data were analysed with CLC Genomics Workbench software, version 4 (CLC bio, Aarhus N, Denmark). Paired‐end read data with quality scores was quality trimmed to remove bases with lower than 0.01 error probability and remove sequences shorter than 30 bases and sequence with unknown bases. For low stringency mapping of reads to reference sequences, the default parameters were used, which included length fraction = 0.5, similarity = 0.8 with mismatch cost = 2, insertion cost = 3, deletion cost = 3 and non‐specific matches handled by random assignment. High stringency read mapping was used for SNP detection and for comparative purposes with the length fraction = 0.9, similarity = 0.9 and non‐specific matches ignored (i.e. only unique matches mapped), otherwise the same as default parameters. For SNP detection, the trimmed sequences were depleted of reads matching repeated regions by mapping with default parameters to known repetitive regions. These were sorghum repeats (TIGR_Sorghum_Repeats.v3.0_0_0.fsa.txt; ), the sugarcane (SP80‐3280) chloroplast genome (NCBI Reference Sequence: NC_005878.2 , ), the S. bicolor mitochondrial genome (GenBank: DQ984518.1 ; ) and an intron from our sugarcane sucrose phosphate synthase 3 ( SPS3 ) sequence that was found to be highly repetitive based. Reads not matching to these repeat sequences were saved and mapped, using high stringency mapping parameters, to the ten chromosomes of the annotated S. bicolor genome sequence ( Paterson , 2009 ) that had previously been downloaded from NCBI () and imported into CLC Workbench. SNP detection using CLC Bio Workbench was carried out on these mapped reads using the following stringent parameters: window length = 21, maximum gaps and mismatches = 2, minimum central base quality score = 30, minimum average quality score = 20, minimum coverage at SNP site = 20, Minimum variant frequency = 5.0%, ploidy = 4, maximum coverage = 1000, sufficient variant count = 50 and required variant count = 2. Acknowledgements Genomic DNA samples of IJ76‐514 and Q165 were obtained from Dr Karen Aitken, CSIRO Plant Industry, Queensland Bioscience Precinct, St Lucia, Qld, Australia. Sorghum R931945‐2‐2 genomic DNA was supplied by Dr Emma Mace, Crop and Food Science, DEEDI Hermitage Research Facility, Warwick Qld. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Plant Biotechnology Journal Wiley

Enrichment of genomic DNA for polymorphism detection in a non‐model highly polyploid crop plant

Loading next page...
 
/lp/wiley/enrichment-of-genomic-dna-for-polymorphism-detection-in-a-non-model-7jt2Lx0U0Y

References (31)

Publisher
Wiley
Copyright
"Copyright © 2012 Wiley Subscription Services, Inc., A Wiley Company"
eISSN
1467-7652
DOI
10.1111/j.1467-7652.2012.00707.x
pmid
22624722
Publisher site
See Article on Publisher Site

Abstract

Introduction The advent of second‐generation DNA sequencing methods has increased sequencing throughput to such an extent that re‐sequencing of some genomes via the whole‐genome shotgun (WGS) approach has become relatively trivial (e.g. Arabidopsis where currently 240 strains have been sequenced as part of the 1001Genomes Project from a total of 1001 planned—). However, for the discovery of DNA polymorphisms across populations and/or in regions of interest in large genomes, the WGS approach is currently too inefficient and costly. For species without an assembled genome sequence where the exact target sequence is unknown, there is an additional level of complexity to assemble reads from the region(s) of interest. Early methods developed for selecting DNA of interest involved screening libraries of cloned DNA using hybridisation‐based assays ( Grunstein and Hogness, 1975 ). This can be a laborious task, especially for libraries from large target genomes such as the human genome ( Bell , 1980 ), and especially because screening of libraries is usually carried out using one or a few probes at a time. Another method in use well before the advent of next‐generation sequencing (NGS) is PCR, which has been an extremely successful approach for enrichment or capture of genomic targets ( Mullis and Faloona, 1987 ). PCR‐based approaches can be effective where a population is being screened to detect polymorphisms and generally where the size of target regions is relatively short and the number of regions relatively small in number (e.g. Bundock , 2009 ; Malory , 2011 ). However, a highly parallel PCR method, based on microdroplets and using specialised technology, has been used successfully to target a large number of regions ( Tewhey , 2009b ). Circularising probe systems that capture target genomic DNA combine specific probe sequences with universal PCR sequences to enable the multiplex amplification of large numbers of captured fragments for next‐generation sequencing (NGS) ( Dahl , 2005, 2007 ). These have evolved from the padlock probe system used for single nucleotide polymorphisms (SNP) genotyping ( Hardenbol , 2003, 2005 ). A recent iteration called molecular inversion probes (MIPs) (reviewed in Mamanova , 2010 ) has been used to capture target genomic DNA, which can then be sequenced directly (without additional library preparation) using a NGS instrument and has been demonstrated to capture 1 Mb of sequence using 13 000 MIPs ( Turner , 2009 ). To overcome PCR limitations, hybridisation capture‐based techniques have been successfully applied to enrich for target sequences. The technique of ‘direct selection’ was used initially for enriching cDNA fractions using large genomic clones ( Lovett , 1991 ; Morgan , 1992 ) but was later developed for selecting genomic DNA using biotinylated BAC DNA ( Bashiardes , 2005 ). A recent approach has been the use of microarrays to capture targets through preferential hybridisation. The captured DNA is eluted from the microarray and sequenced ( Albert , 2007 ; Okou , 2007 ). A modification of this approach has been to carry out the hybridisation in solution using the microarray as a platform for oligonucleotide synthesis ( Gnirke , 2009 ; Tewhey , 2009a ). The human exome has been the target for enrichment strategies by some groups, and this large and dispersed subset of the human genome has been successfully targeted using microarray‐based capture ( Hodges , 2007 ). One goal for the capture and sequencing of human exomes has been to identify the causes of genetic disorders, which has been demonstrated in principle by Ng (2009) and then demonstrated for two Mendelian‐inherited disorders—Millers syndrome ( Ng , 2010b ) and Kabuki syndrome ( Ng , 2010a ). An example of array‐based capture used successfully on a plant species is described by Fu (2010) where two sequential array hybridizations were carried out on maize. The first hybridisation was carried out with a repeat‐based array, and the second with the target array. This was designed to create a blocker‐free capture protocol because a large proportion of the maize genome is composed of repeat sequences. For many plant species of economic importance, a complete genome sequence is not yet available, so there is no reference genome for the design of capture probes. For those wishing to discover DNA sequence polymorphisms en masse , obtaining sufficient coverage of selected regions using WGS sequencing could be prohibitively expensive. An alternative solution is explored in this paper, using a genome sequence from a close relative both for the design of probe sequences for targeted enrichment and for the alignment of the resulting enriched reads. This approach greatly extends the number of species to which targeted enrichment strategies will be possible. Results and discussion Creation of sequencing libraries and mapping of sequencing reads Two sugarcane genotypes were chosen for the SureSelect targeted enrichment procedure: IJ76‐514, a pure Saccharum officinarum clone originally sourced from Irian Jaya, and Q165, a commercial cultivar that is a backcrossed hybrid of S. officinarum with Saccharum spontaneum . These genotypes are parents of the PJ2 mapping population ( Aitken , 2005 ). A WGS library for Illumina Genome Analyser (GA) analysis was constructed for each genotype with each library run on a single GA lane producing 76‐bp paired‐end sequence reads ( Table 1 ). The proven WGS libraries were then subjected to the Agilent SureSelect enrichment procedure and again each run on a single GA lane with 76‐bp paired‐end reads on an Illumina GAIIx slide. After trimming and discarding low‐quality sequence and depleting those reads that matched to chloroplast, mitochondrial or known repeat sequences, there were between 26 and 41 million reads acquired for each genotype/library (1.65 to almost 3 gigabases of sequence) ( Table 1 ). A library derived from a Sorghum bicolor genotype (R931945‐2‐2) was enriched using the same probes for comparative purposes. All sugarcane sequences have been lodged at the Short Read Archive at the National Center for Biotechnology Information (NCBI, SRA ) with the SRA accession number SRA051387.1 . 1 Summary of DNA sequencing of two sugarcane genotypes (Q165 and IJ76‐514) and a sorghum genotype Genotype/Library type No. of reads obtained (millions) Total length of sequences (Gb) No. of reads after quality trimming (millions) No. of reads after depletion of repeats Total no. of bases after depletion (Gb) IJ76‐514 WGS 30.9 2.06 27.2 26.6 1.65 Q165 WGS 38.7 2.48 32.7 31.9 1.86 IJ76‐514 enriched 57.9 4.29 56.2 41.0 2.98 Q165 enriched 51.7 3.81 50.3 38.0 2.77 Sorghum enriched 66.7 3.79 62.2 46.9 2.59 WGS, whole‐genome shotgun. Enriched—library created using the Agilent SureSelect targeted enrichment process. The WGS sequences from the two sugarcane genotypes were aligned to the sorghum genome with around 60% of WGS reads from both genotypes mapping to this genome and more than 30% mapping to genes using low stringency thresholds ( Table 2 ). This indicates the high degree of similarity between the sugarcane and sorghum genomes and adds weight to the plausibility of enriching the sugarcane genome based on probes designed to sorghum DNA sequence. The subsequent enrichment of these libraries using probes designed mainly to sorghum coding regions increased the proportion of reads that mapped to the sorghum genome, particularly the genes, where more than 50% of reads aligned at low stringency ( Table 2 ). Under stringent alignment conditions, the percentage of reads mapping to the sorghum genome was twice as high for the enriched libraries compared with the WGS libraries, and for mapping to the genic regions, there was at least a threefold increase when the enriched library was compared with the corresponding WGS library for both genotypes ( Table 2 ). This was an initial indication that the enrichment process was successful in capturing the targeted genic regions. 2 Mapping of reads to the sorghum genome sequence Genotype/Library type Percentage of reads mapping to sorghum genome * Percentage of reads mapping to sorghum genes IJ76‐514 WGS 57 (16) 31 (9) Q165 WGS 60 (17) 34 (11) IJ76‐514 enriched 75 (35) 53 (29) Q165 enriched 76 (39) 58 (33) Sorghum enriched 98 (67) 83 (54) WGS, whole‐genome shotgun. Enriched—library created using the Agilent SureSelect targeted enrichment process. * Percentages in parentheses are for reads that mapped uniquely under stringent mapping conditions of length = 0.9 and similarity = 0.9; otherwise, default parameters of mapping non‐unique matches, length = 0.5 and similarity = 0.8 were used. The extent to which the method of enrichment has succeeded in capturing sequences with homology to the probe sequences can be partly judged by the enrichment factor. This was calculated empirically for the sugarcane genotypes as the ratio of the proportion of the reads from the SureSelect enriched library that mapped to a probe sequence to the proportion of reads from the corresponding WGS library that mapped to a probe sequence. Using reads after quality trimming, the enrichment factor for IJ76‐514 was 10.0‐fold and for Q165 was 11.3‐fold based on read mapping to all probes ( Table 3 ). To determine the success of the strategy to use probes based on sorghum gene sequences to capture sugarcane sequences, the enrichment factor was calculated for a set of these probes alone. For IJ76‐514, the enrichment was ninefold, and for Q165 it was 12‐fold ( Table 3 ). Thirty‐four per cent of sorghum R931945 reads matched to the sorghum baits compared with around 25% for Q165 and 19% for IJ76‐514 enriched libraries ( Table 3 ). The enrichment of the sorghum sample could not be calculated empirically as there was no WGS sequence for comparison; however, the fraction of the sorghum genome covered by these sorghum probe sequences is 0.6%. As the length fraction parameter for read mapping used here was 0.5, 76‐bp reads will map if they overlap the 120mer probe sequences by 38 bp of sequence. This effectively extends the probe coverage by 76 bp (38 bp at each end) to 196 bp. This extends the coverage of the probes to 1% of the genome and the fold enrichment would be 34‐fold, with a theoretical maximum of 100‐fold enrichment. None of these calculations take into account the length of the inserts that have been captured, which had a mean length of 230 bp for IJ76‐514, 180 bp for Q165 and 280 bp for sorghum. In the case of IJ76‐514 and sorghum, the mean inserts would further extend the target region on either side of each probe and so the estimates for enrichment are conservative. For the probes designed to tile across sucrose metabolism genes and 242 gene fragments previously targeted using 454 sequencing ( Bundock , 2009 ), the enrichment for both IJ76‐514 and Q165 was 35‐fold ( Table 3 ). Clearly relative to the enriched sorghum library, the level of enrichment was greater in the sugarcane genotypes for these genes, likely due to higher homology. The level of enrichment was also greater than that found for the probes designed to sorghum sequences but also possibly assisted by the tiling strategy (see Experimental procedures ) and the fact that there were extra copies of these probes for hybridisation in the SureSelect Oligo capture library (see Experimental procedures ). 3 Reads mapping to 120‐bp target probe sequences used for SureSelect enrichment Genotype/Library type Percentage of reads mapping to all probe sequences * Percentage of reads mapping to 34 105 probes designed to sorghum genes only Percentage of reads mapping to 961 probes 2× tiled across sugarcane genes Percentage of reads mapping to 7263 SureSelect probes designed to sugarcane ESTs IJ76‐514 WGS 5.1 2.2 0.7 2.3 Q165 WGS 5.1 2.1 0.6 2.4 IJ76‐514 enriched 50 (×10) 19 (×8.9) 23 (×33) 12 (×5.0) Q165 enriched 57 (×11.2) 25 (×11.9) 21 (×35) 14 (×5.8) Sorghum enriched 42 34 (×34) 3.4 2.7 WGS, whole‐genome shotgun. Enriched—library created using the Agilent SureSelect targeted enrichment process. *Figures in parentheses are the enrichment factor obtained by dividing the enriched percentage by the corresponding WGS percentage, or for sorghum calculated from expected coverage. Discovery of single nucleotide polymorphisms The main aim of this project was to discover SNPs in a large number of sugarcane genes and in addition to reads covering probes, sequence from either side of the target probe is also expected to be captured during enrichment. As the majority of the SureSelect probes were based on the sorghum genome, this genome was chosen as the main reference for aligning the Illumina reads from all libraries for SNP detection. SNP discovery was carried out on the reads aligned to the ten chromosomes of the sorghum genome sequence after depleting reads that matched to chloroplast, mitochondrial or known repeat sequences. The main focus was to find SNPs between the homologous loci within each sugarcane genotype as these provide the potential for mapping ( Bundock , 2009 ). Large numbers of candidate SNPs were identified from the sequence alignments and although fixed SNPs between sugarcane and sorghum were readily identified, interestingly these were in the minority ( Table 4 ). Even with stringent parameters for SNP detection, an enormous number of putative SNPs within each sugarcane genotype were detected, particularly from the enriched libraries with more than a quarter of a million SNPs detected as polymorphisms within each sugarcane genotype ( Table 4 ). For the enriched libraries, the SNPs were found in reads mapping to a large number of sorghum genes (almost 13 000 for IJ76‐514 and nearly 16 000 for Q165). This compares to a much smaller number of genes with SNPs for the WGS libraries (around 400), indicating that the enrichment procedure has worked to enrich for reads from the targeted genes ( Table 4 ). When the reads from the Q165 and IJ76‐514 enriched libraries were combined and mapped to the sorghum genome, the number of SNPs detected increased to 446 329 for SNPs within sugarcane, with 230 692 of these in predicted genes. There were in addition 232 270 fixed SNPs between sugarcane reads and the sorghum reference. 4 Putative SNPs detected for each of the two parental sugarcane genotypes for the two different libraries (WGS versus enriched) after reads were aligned to the Sorghum bicolor genome sequence Genotype/Library type Library size (millions of reads) Total no. of putative SNPs * No. of SNPs within sugarcane reads only Sugarcane SNPs within annotated genes Genes with one or more SNP sites No. of predicted amino acid changes IJ76‐514 WGS 30.9 59 417 50 947 3169 384 341 Q165 WGS 38.7 65 808 56 839 3276 392 334 IJ76‐514 enriched 57.9 434 633 268 797 139 725 12 963 51 843 Q165 enriched 51.7 430 549 285 836 168 090 15 799 64 098 WGS, whole‐genome shotgun; SNPs, single nucleotide polymorphisms. Enriched—library created using the Agilent SureSelect targeted enrichment process. * SNP numbers are from SNP discovery analyses using minimum read coverage at a site of 20 and minimum 5% frequency of alternative allele. To ensure that the large differences in the number of SNPs detected between WGS and enriched libraries was not due mainly to differences in the number of reads obtained from each library and also to equitably compare SNP numbers between genotypes, 25 million reads were randomly selected from each library and used for SNP detection ( Table 5 ). This clearly showed that there was a large increase in the number of SNPs (4–5.5‐fold) detectable in the reads from the enriched libraries compared to WGS libraries, particularly in genic regions (36–60‐fold). As this could not be attributed to differences in the number of reads in this case, it is likely to be due to the targeting of these reads as a result of the enrichment process. It could also be seen that although there were a larger number of SNPs detected in IJ76‐514 from WGS compared with Q165 (15% more), the Q165 enriched library provided a larger number of SNPs than the IJ76‐514 enriched library (18% more). About 61% of SNPs detected from the Q165 enriched library occurred within annotated genes compared with 54.6% for the IJ76‐514 enriched library. For the WGS libraries, the corresponding proportions are 5.5% and 6%, respectively. The larger number of SNPs detected in IJ76‐514 compared with Q165 from the same number of WGS reads ( Table 5 ) may not necessarily be due to a higher frequency of SNPs within this genotype. A difference in genome size may contribute so that fewer SNPs cross the sequence depth threshold for Q165. The larger number of SNPs in Q165 from enrichment may also not be due to greater SNP frequency but may be largely due to higher proportion of capture of targets owing to greater similarity between designed probes and the Q165 genome, possibly due to the S. spontaneum DNA represented in this genome being more similar to sorghum sequence. Evidence for this interpretation comes from the fact that a higher proportion of reads from Q165 (WGS) map to the sorghum genome ( Table 2 ). 5 Putative SNPs detected from 25 million randomly selected reads for each of the two parental sugarcane genotypes for the two different libraries (WGS versus enriched) Genotype/Library type Putative SNPs within sugarcane genotype only * Putative SNPs in annotated genes IJ76‐514 WGS 46 616 2825 Q165 WGS 40 582 2246 IJ76‐514 enriched 188 340 102 892 Q165 enriched 222 666 136 042 WGS, whole‐genome shotgun; SNPs, single nucleotide polymorphisms. Enriched—library created using the Agilent SureSelect targeted enrichment process. * SNP numbers are from stringent SNP discovery analyses using minimum read coverage (20) with minimum of 5% frequency of alternative allele. A random selection of reads, for subsets of different sizes, was sampled from both the WGS and enriched Q165 libraries and used for SNP analysis. A plot of the number of SNPs detected versus the number of reads in the subset indicated the different trends for the two libraries ( Figure 1 ). The trend line for the enriched library indicates that the number of SNPs detected is approaching saturation from the single lane of sequence, whilst the trend line for the WGS library is still climbing steeply. This indicates that for the SNPs detected from the enriched library, which constituted SNPs mainly within the targeted genic region, most had been detected, whilst for the WGS library, SNPs from across the whole genome are sampled randomly and constitute a much larger total pool of SNPs that would require many lanes of sequence before saturation is observed. 1 Plot of the number of SNPs discovered within the DNA sequence reads from the sugarcane genotype Q165. Reads were selected randomly from the total set in subsets of various sizes and aligned to the sorghum genome for SNP discovery. SNP numbers are from SNP discovery analyses using low minimum read coverage (4) with minimum of 1% frequency of alternative allele. (a) Number of SNPs discovered in reads from the SureSelect enriched library for Q165. The trend line indicates that the curve is flattening out and reaching saturation. (b) SNPs discovered from the whole‐genome shotgun (WGS) reads from Q165. The trend line is concave up, indicating that as expected the genome is far from saturated with reads. Most WGS SNPs are not located in target regions. An analysis was carried out to determine the extent to which earlier 454 sequencing of PCR products and cross‐genotype SNP discovery could contribute to SNP validation. As part of the SureSelect probe design, a set of probes were tiled across short products (approximately 200–400 bp) that had been sequenced from the same two genotypes using 454 technology ( Bundock , 2009 ). Illumina SureSelect reads, and separately 454 reads, from both sugarcane genotypes were aligned to 205 of these sequences (derived from 242 which had been formed into contig consensus sequences). For Q165, 1906 SNPs were detected within these regions using the same stringent SNP detection parameters used for the discovery of SNPs from reads aligned to the sorghum genome. One thousand three hundred and 23 of these SNPs were discovered as sequence variants within Q165 from the 454 reads (69.4%). A further 332 of the 1906 Q165 SNPs were discovered within IJ76‐514 Illumina reads, and a further seven SNPs could be found in IJ76‐514 454 reads. So of the 1906 SNPs found in Q165 SureSelect sequence, there was independent sequence validation for 87% (1653). For IJ76‐514, 1428 SNPs were discovered from aligning the Illumina SureSelect reads to the 205 consensus sequences, with 1298 of these SNPs confirmed as present in other sequence—mostly 454 sequence from IJ76‐514—providing validation for 91%. These high validation percentages for these subsets, 87% for Q165 and 91% for IJ76‐514, should be applicable to, and give a high degree of confidence in, the SNPs discovered from alignment of SureSelect reads to sorghum genes generally. A cross‐genotype validation, for all the SNPs discovered in genes from the Illumina SureSelect reads, was carried out for both genotypes. From the Q165 enriched library, of the 168 090 SNPs found to be within sorghum‐defined genes ( Table 4 ), 81 368 (48.4%) were also identified as SNPs in the IJ76‐514 enriched library. These shared SNPs constitute almost 50% of the Q165 SNPs in genes and almost 60% of IJ76‐514 SNPs in genes. There are a total of 226 447 different SNP sites found in genes when results from separate genotype analyses are combined. Again this cross‐genotype validation provides confidence in the SNPs discovered. Some proportion of these SNPs may represent sequence differences between paralogous loci and not be allelic. One way to remove most of the SNPs that are due to sampling two paralogs is to set a threshold of more than 55% sequence representation for the major allele (for bi‐allelic SNPs). This will remove those SNPs where there is almost equal representation from the two alleles which would occur where two paralogs have been sampled (assuming equal number of chromosome copies for each paralog). For Q165, of the 163 248 bi‐allelic SNPs in genes, 154 733 (approximately 95%) have a major allele with >55.0% frequency, and thus only a small proportion would be excluded using this suggested threshold. For all the bi‐allelic SNPs detected within predicted genes, the frequency of the minor allele was calculated, and a frequency histogram created for each genotype ( Figure 2 ). SNPs with a coverage of 50–400 were selected for this because at low coverage artifactual peaks arise owing to the large number of SNPs in this category and the small number of possible frequencies. At the high end of coverage (400 used arbitrarily here), SNPs are more likely to result from sequences that are repeated in the Saccharum genome but are unique in the S. bicolor genome and will tend to have low minor allele frequency. There is a higher proportion of SNPs for Q165 than for IJ76‐514 where the minor allele is at the rare end of the spectrum ( Figure 2 )—with more than 43% of SNPs having a minor allele frequency of 0.05–0.127 in Q165 compared with 36% in IJ76‐514. This higher proportion of rare SNPs was observed in an earlier study of SNPs in gene fragments in these same sugarcane genotypes ( Bundock , 2009 ). A possible explanation for this observation is that the S. spontaneum chromosomes of the Q165 genome would be likely to have many sequence differences from the S. officinarum chromosomes providing numerous SNPs on a minority of chromosomes. In addition, the number of chromosome copies is likely to be larger for Q165 than for IJ76‐514. This would lead to a larger proportion of SNPs with the minor allele being rare compared with IJ76‐514. 2 Histogram showing the percentage of SNPs with the corresponding minor allele frequency. Pertains to bi‐allelic SNPs found at a coverage threshold of 50–400 within genes from the (a) IJ76‐514 enriched library (b) Q165 enriched library. The SureSelect approach has been highly successful with regard to the discovery of SNPs from the enriched reads aligning to genes. With an enrichment of 10–11‐fold, this method is economical, as the cost of library enrichment per library is much less than the cost required to produce the equivalent amount of sequence to cover the genic regions to an equivalent read depth. The method has allowed discovery of very large numbers of SNPs and should allow the mapping of a significant proportion of the genes in this complex genome. This strategy will enable high‐density SNP maps of the sugarcane genome to be constructed, which may be the key to successful assembly of a reference genome sequence for sugarcane ( Souza , 2011 ). As candidates for mapping, the most useful SNPs are single dose in one parent. This occurs where one allele is represented on one chromosomal copy only and the alternative allele occurs on the other copies in the parent genome, with the second parent having only the major allele present on all chromosome copies. Ideally, the rare allele is absent from the other parent, and a single‐dose marker inherited from one parent can be developed, which can segregate 1 : 1 in the progeny and be fully informative for mapping (i.e. all progeny are informative). Selecting SNPs for assay development that have one allele at a low frequency is a prudent strategy for developing these single‐dose markers ( Figure 3 ). Ideally, the frequency would correspond to the reciprocal of the number of chromosomal copies—which for commercial sugarcane varieties is usually unknown. As the DNA sequences analysed here are publicly available, this now represents hundreds of thousands of sugarcane SNPs that are available for sugarcane researchers to use in mapping and association studies. 3 An alignment of reads from the SureSelect enriched library of Q165 to a region of sorghum chromosome 8 with SNPs highlighted. There is a C/A SNP at position 19 205 (see arrow) where the cytosine minor allele (C in blue) is at 8.6% frequency (34/360). This is a good candidate SNP for single‐dose marker development as the likely ratio of minor allele to major is 1:11, and thus, the SNP is likely to be present on one chromosome copy with 11 copies having the alternative allele. Two bases upstream at 19 207, there appears to be a difference between sugarcane (G in yellow) and the sorghum reference (A). Design of probe sequences for hybridisation during capture There were five different sources of probe sequence design for the baits used for capture during enrichment ( Table 6 ). The most numerous were single probes designed to the coding region of predicted sorghum genes, with the next most numerous being probes designed to sugarcane expressed sequence tags (ESTs) that appeared not to be represented in the predicted sorghum genes. A third group of probes were designed to genomic sequence corresponding to selected fragments (202–445 bp in length) of 240 sugarcane genes included in a previous study. To compare the effectiveness of probe design, each probe sequence was used as a reference for read mapping. The GC content of probe designs was found to be highly significantly associated with the average read coverage for a probe. Probes with a GC content of around 30%–40% had the highest average read coverage ( Figure 4 ), a property that is probably directly related to DNA capture performance. 6 Design of probes used for enrichment of sugarcane and sorghum genomic libraries Target sequence Number of probes Number of times represented on manufacturing array Design strategy Predicted sorghum coding sequences—28 008 in total 44 429 1 Either a probe at both the 5′ and 3′ ends,or a probe at one end or mid sequence Sugarcane ESTs 7263 1 1 centrally located probe Sucrose metabolism genes—genomic sequence—sugarcane 757 4 2× tiling Sucrose metabolism genes—genomic sequence—sorghum 204 4 2× tiling 240 sugarcane gene fragments from 454 genomic DNA sequence 595 3 1× tiling Total 53 248 4 Graph of the GC content versus the average coverage of probes (baits) with reads mapped from the Q165 enriched library. The mean of the average coverage across the 120mer probes for each 10% band (25%–35%, 35%–45% etc.) has been calculated. Error bars are ± 1 SE of the mean. It can be observed that high GC content probes tend to have reduced coverage compared to probes with 25%–45% GC content which are optimal. For the probes designed to the sorghum coding regions, there were three locations used for design: extreme 5′ end, middle and extreme 3′ end. An analysis was undertaken to determine whether there was any effect of position on the ability of probes to capture sequences from hybridisation. However, it was found that probes designed at the start of a predicted coding region (5′ end) had a significantly higher GC content ( P < 0.0001, ANOVA) than those designed to the middle or the 3′ end of the coding region ( Figure 5 ). So a model that included probe position, percentage GC content of probe and the interaction was fitted for the sorghum probes to explain the average coverage for each probe and used to analyse all three enriched libraries. For the enriched sorghum library, only GC content was found to be a statistically significant influence on the capture efficiency, and the position of the probe within the coding DNA sequence (CDS) and the interaction were not significant. For both of the sugarcane genotypes, all three terms were significant and probes designed at the extreme 5′ end had higher average coverage when GC content was taken into account than probes at the extreme 3′ end. Based on a smaller number of probes designed to the middle of sorghum coding regions, there was no significant difference in performance to probes at the 3′ end for either sugarcane genotype. 5 For probes designed to the coding region of predicted sorghum genes (the majority of probes), the region to which the probe was designed was found to significantly influence the GC content of probes on average, with 5′ end sequences leading to probes with higher GC content on average (a). These probes were also less effective on average at capturing sequence as judged by the number of reads that mapped to the probe sequences after sequencing enriched libraries. The case for sorghum is presented (b) with a similar result found for both the sugarcane enriched libraries. Error bars are 95% confidence intervals. This analysis was carried out after removing the 1% of probes with the highest coverage to reduce undue influence of very highly repeated regions. The likely explanation for these observations is that 5′ end probes tend to represent more conserved sequences than probes at the 3′ end, so that when there are sequence differences between probe and target, the capture of targets and/or read alignment during mapping is reduced. This is supported by the fact that an analysis of SNP distributions indicated that there was a very highly statistically significant difference between SNP abundance between the three locations with the 3′ end having a much higher number of SNPs than expected and the 5′ end and middle having a smaller number of SNPs for all three enriched libraries (chi‐square test, P << 0.0001). Even though probes at the 5′ end had a higher coverage when GC content was taken into account, the GC content of 5′ end probes tends to be too high for the design of optimal probes (and they also miss out on capturing as much variation as 3′ end probes). The best recommendation therefore is to design a 3′ end probe as a matter of course because it is likely to have a GC content closer to optimum and also likely to encompass regions with variability. However, as insurance, if possible, a 5′ end probe could be designed to allow for those situations where there is too much variability at the 3′ end and there may be sufficient capture to enable alignment and detection of sequence variation. The caveat would be to design both probes, if possible, to a location where the GC content is below 55%, to reduce the number of poorly performing probes. If WGS sequence of the target species has been obtained, it could be used profitably during the probe design stage to determine whether any potential probe sequences have captured repetitive elements by mapping WGS reads to prospective probe sequences and culling those with high or very high coverage as belonging to repetitive elements. As expected, the capture of sorghum sequences from the sorghum library with probes designed to the coding region of sorghum genes was more effective than the capture of sequences with probes designed to sugarcane ESTs and vice versa ( Figure 6 ). The same effect was observed for the probes tiled across sucrose metabolism genes—those designed to sorghum genes had a higher coverage in the sorghum enriched library than those designed to sugarcane and vice versa for the sugarcane libraries (data not shown). The probes designed to tile across genes or fragments had a higher average probe coverage than the non‐tiled probes although this effect is confounded with the additional number of times these probes were represented on the array for the manufacture of the baits. 6 The read coverage, after mapping reads to the probe sequences used for sequence capture for enrichment, is strongly influenced by the GC content of the probe sequences. After removing the effect of GC content of probes, there is a strong effect in the residuals, of probe sequence origin—whether sorghum or sugarcane. For the enriched sorghum library, the read coverage is significantly higher for probes derived from sorghum coding DNA sequence (CDS) compared to sugarcane ESTs (a). For both sugarcane libraries, the opposite is the case—probes designed to sugarcane ESTs have higher coverage on average than those designed to sorghum CDS (b—the Q165 genotype). An almost identical graph to b was obtained for IJ76‐514. The error bars represent 95% confidence intervals for the means. The sequencing of gene‐enriched genomic DNA complements other strategies that employ sequencing to discover genome variations. Amplicon sequencing has been shown to allow the analysis of variation in many loci in genotypes of interest ( Bundock , 2009 ; Kharabian‐Masouleh , 2011 ), providing efficient SNP discovery. This more targeted SNP discovery approach is limited to much smaller numbers of loci than possible with the method reported here. Transcriptome sequencing ( Winfield , 2010 ) is useful but will require analysis of many different tissues and developmental stages to achieve the level of genome coverage possible with enriched genome sequencing. In conclusion, we have demonstrated the application of a solution hybridisation capture‐based method targeted to enrich genes in the sugarcane genome based mainly on gene sequences found in the recently sequenced sorghum genome. The use of related genomes for design of probes might allow this approach to be widely applied in polymorphism discovery in poorly described species for which a closely related reference genome is available. Experimental procedures Design of probes for enrichment Predicted coding regions from S. bicolor gene models were sourced from Phytozome v4.0 (). Sugarcane tentative consensus sequences (TCs), representing sugarcane transcripts, were downloaded from the Sugarcane Gene Index 2.2 (). Additional sugarcane sequences were retrieved from the Representative Public identifiers corresponding to the probe sets on the Sugar Cane Whole Genome Array (). The 120 base sequences were extracted from each sequence using text functions in Microsoft Excel. Simple repeats and low complexity sequences with each 120‐base pair sequence were identified using Repeatmasker (). Homology searching using the BLAST algorithm ( Altschul , 1990 ) was performed on local customised versions of the Sugarcane Gene Index 2.1, the Oryza sativa (rice) coding regions and the S. bicolor gene models, hosted at the CSIRO Bioinformatic Facility, Canberra, ACT, Australia. All 36 338 S. bicolor gene model coding regions (Sb_cds) were retrieved from Phytozome v4.0. The first 120 bases and last 120 bases (left_120mer and right_120mer, respectively) were extracted from each. Each 120mer was assessed for simple repeats and low complexity as well as for the presence of Ns in the sequence. 33 152 of the left_120mers and 35 165 of the right_120mers did not contain either Ns or low complexity sequences. Identical oligos from highly related Sb_cds were removed from both sets. This resulted in the retention of 31 770 left_120mers and 33 829 right_120mers, which represented a total of 34 965 Sb_cds. After filtration of the results of BLASTn homology against the Sugarcane Gene Index 2.2 and the Oryza sativa (rice) coding regions, it was determined that 7936 of the S. bicolor gene models were not represented by a suitable 120mer. These gene models were targeted by homology searching their sequences against the Sugarcane Gene Index. 1080 of the gene models had alignments in excess of 120 bases long. Matching 120mer oligonucleotides were designed, commencing at the start of the matching sequence, reducing to 981 after filtration to remove 120mers containing Ns (Table S1). Two additional approaches were used to maximise the number of individual useful probes designed. The first involved using the Representative Public identifiers corresponding to the probe sets on the Affymetrix Sugar Cane Whole Genome Array (depleted for control and chloroplast‐related probe sets) to identify 7328 unique sugarcane sequences for further examination. These were homology searched against the S. bicolor gene models that had not yet been represented by a 120mer. Of the 770 matches returned, once alignment length and duplicate matches had been taken into account, 284 additional 120mers were designed, commencing at the start of the match to S. bicolor sequence. The second approach targeted the 27 728 singletons and 7263 TCs that were not homologous to the 120mers designed so far. An attempt was made to design 120mers to each of the sequences, commencing at base 101 to avoid possible poorer‐quality sequence at the extreme 5′ end of each sequence. Once the resulting oligonucleotides were filtered to remove sequences that were <120 bases long, 7134 new 120mers remained (Table S2). The number of 120‐base S. bicolor and sugarcane oligonucleotides designed and validated from all approaches described above was 51 847, reducing to 51 692 after a final check to detect oligonucleotides with inadvertent ambiguities. A set of probes were also designed to tile across (two‐times tiling) eight genes for sucrose metabolism—the five genes for sucrose phosphate synthase I–V ( SPS I–V ), soluble acid invertase ( SAI ) and sucrose synthase 1 and 2 ( SuSy ). Two‐times tiling involves designing a set of probes that lie end to end and span the target sequence with a second set of probes offset (overlapping the first set) by 50% of the length of a probe (60 bases in this case) also lying across the length of the target. The genomic DNA sequence of four sugarcane SPS genes ( SPS II–V ) was used from sequences obtained at Southern Cross University. The sequence used for the design of probes for SPS I was obtained from sorghum along with SuSy1 . A set of 240 gene fragments amplified in an earlier experiment and sequenced using 454 technology ( Bundock , 2009 ) were also included for probe design using a two‐times tiling approach (Table S3). Table 6 shows the number of probes designed for each category, and the number of times each probe was represented on the array for manufacture, to create the bait sequences for hybridisation. Preparation and sequencing of sugarcane genomic DNA libraries Whole‐genome shotgun libraries from the genomic DNA of Q165 and IJ76‐514 were prepared using the Illumina DNA paired‐end library prep kit. The Covaris S2 was used to shear the genomic DNA before adapter ligation. Size selection was carried out on an Invitrogen E‐gel 2% with a target size of around 320 bp. Analysis on an Agilent Bioanalyser indicated that the size range of the vast majority of fragments was from 290 to 410 bp for IJ76‐514, whilst for Q165 the size range was from 220 to 380 bp post‐PCR (i.e. including 120 bases of adapter sequence). Aliquots of WGS libraries for Q165, IJ76‐514 and Sorghum R931945‐2‐2 were used for the SureSelect procedure that was carried out according to the protocol from Agilent (SureSelect Target Enrichment System, Illumina Paired‐End Sequencing Platform Library Prep, Protocol Version 1.0, September 2009) and using the SureSelect reagents supplied by Agilent (Agilent Technologies Inc., Santa Clara, CA). Paired‐end sequencing was carried out using the Illumina GA IIx with 76‐base read length. Analysis of illumina sequence data Illumina sequence data were analysed with CLC Genomics Workbench software, version 4 (CLC bio, Aarhus N, Denmark). Paired‐end read data with quality scores was quality trimmed to remove bases with lower than 0.01 error probability and remove sequences shorter than 30 bases and sequence with unknown bases. For low stringency mapping of reads to reference sequences, the default parameters were used, which included length fraction = 0.5, similarity = 0.8 with mismatch cost = 2, insertion cost = 3, deletion cost = 3 and non‐specific matches handled by random assignment. High stringency read mapping was used for SNP detection and for comparative purposes with the length fraction = 0.9, similarity = 0.9 and non‐specific matches ignored (i.e. only unique matches mapped), otherwise the same as default parameters. For SNP detection, the trimmed sequences were depleted of reads matching repeated regions by mapping with default parameters to known repetitive regions. These were sorghum repeats (TIGR_Sorghum_Repeats.v3.0_0_0.fsa.txt; ), the sugarcane (SP80‐3280) chloroplast genome (NCBI Reference Sequence: NC_005878.2 , ), the S. bicolor mitochondrial genome (GenBank: DQ984518.1 ; ) and an intron from our sugarcane sucrose phosphate synthase 3 ( SPS3 ) sequence that was found to be highly repetitive based. Reads not matching to these repeat sequences were saved and mapped, using high stringency mapping parameters, to the ten chromosomes of the annotated S. bicolor genome sequence ( Paterson , 2009 ) that had previously been downloaded from NCBI () and imported into CLC Workbench. SNP detection using CLC Bio Workbench was carried out on these mapped reads using the following stringent parameters: window length = 21, maximum gaps and mismatches = 2, minimum central base quality score = 30, minimum average quality score = 20, minimum coverage at SNP site = 20, Minimum variant frequency = 5.0%, ploidy = 4, maximum coverage = 1000, sufficient variant count = 50 and required variant count = 2. Acknowledgements Genomic DNA samples of IJ76‐514 and Q165 were obtained from Dr Karen Aitken, CSIRO Plant Industry, Queensland Bioscience Precinct, St Lucia, Qld, Australia. Sorghum R931945‐2‐2 genomic DNA was supplied by Dr Emma Mace, Crop and Food Science, DEEDI Hermitage Research Facility, Warwick Qld.

Journal

Plant Biotechnology JournalWiley

Published: Aug 1, 2012

Keywords: ; ; ;

There are no references for this article.