TY - JOUR AU - Willett, Christopher S AB - Abstract The formation of reproductive barriers between allopatric populations involves the accumulation of incompatibilities that lead to intrinsic postzygotic isolation. The evolution of these incompatibilities is usually explained by the Dobzhansky–Muller model, where epistatic interactions that arise within the diverging populations, lead to deleterious interactions when they come together in a hybrid genome. These incompatibilities can lead to hybrid inviability, killing individuals with certain genotypic combinations, and causing the population’s allele frequency to deviate from Mendelian expectations. Traditionally, hybrid inviability loci have been detected by genotyping individuals at different loci across the genome. However, this method becomes time consuming and expensive as the number of markers or individuals increases. Here, we test if a Pool-seq method can be used to scan the genome of F2 hybrids to detect genomic regions that are affected by hybrid inviability. We survey the genome of hybrids between 2 populations of the copepod Tigriopus californicus, and show that this method has enough power to detect even small changes in allele frequency caused by hybrid inviability. We show that allele frequency estimates in Pool-seq can be affected by the sampling of alleles from the pool of DNA during the library preparation and sequencing steps and that special considerations must be taken when aligning hybrid reads to a reference when the populations/species are divergent. hybrid incompatibility, postzygotic reproductive isolation, Tigriopus The formation of new species is a product of reproductive barriers that evolve between diverging populations, and can occur by a number of different mechanisms that decrease gene flow. One key form of reproductive isolation is intrinsic postzygotic isolation, under which the hybrids between 2 incipient species suffer from lower fitness due to partial or complete sterility or inviability. The evolution of these postzygotic barriers to gene flow is usually explained by the accumulation of deleterious epistatic genic incompatibilities between the hybridizing taxa, known as Dobzhansky–Muller (DM) incompatibilities (Dobzhansky 1936; Muller 1942). In the DM model, for 2 populations living in allopatry, different portions of their genomes may diverge due to natural selection or drift, forming novel alleles that have beneficial or neutral effects in the populations they evolved in. When these novel alleles (generally at different loci) come together in a hybrid genome, their interactions, which have never been tested in the same genetic background, can have deleterious effects leading to lowered hybrid fitness. Tigriopus californicus, an intertidal copepod that lives in upper intertidal pools on the west coast of North America, is an excellent system in which to study the early stages of intrinsic postzygotic isolation. Populations of these copepods are allopatric, with highly restricted gene flow, substantial levels of genetic divergence amongst them (Burton 1997; Willett and Ladner 2009), and low levels of polymorphism within them (Willett 2012; Pereira et al. 2016). Crosses between many of these populations have shown that first generation hybrids (F1) are usually equal in fitness, or even superior, to the parental populations, while second generation hybrids (F2) have on average lower fitness (Burton 1987; Edmands 1999) indicating that most of the incompatibilities are at least partially recessive. A common way to survey regions of the genome that are affected by hybrid inviability is to genotype individuals at different loci across the genome, and determine if the observed genotypic ratios deviate from expected ratios. However, designing markers becomes time consuming and expensive as the number of markers increase. Newer high-throughput sequencing methods, (e.g. whole genome sequencing, RADseq, genotype by sequencing) allow for many loci to be genotyped, but these methods are unfeasible if the organism being studied is too small to yield enough DNA for sequencing (as is the case for T. californicus), and costs can quickly increase as more individuals are sequenced. One way around this is to sequence the DNA from pooled individuals (Pool-seq [Burke et al. 2010; Futschik and Schlötterer 2010; Turner et al. 2011; Gautier et al. 2013; Schlötterer et al. 2014]), which allows you to sequence a large number of individuals at high coverage, with little effort. Here, we used a Pool-seq approach to sequence the genome of T. californicus F2 hybrids from a cross between 2 southern California populations, looking at deviations in allele frequency to determine regions of the genome that are affected by hybrid inviability. We were interested in determining whether this method is able to detect shifts in allele frequency caused by hybrid inviability in F2 hybrids, as even strong DMIs may not change allele frequencies much in a single generation. Our results show that this method has the power to detect even small changes in allele frequency and yields a comprehensive picture of the genome-wide effects of hybrid inviability. Material and Methods Population Sampling, Crossing Design, DNA Isolation, and Sequencing Tigriopus californicus were collected from intertidal rocky pools in Abalone Cove (AB, 33°44’ N, 118°22’ W) and San Diego (SD, 32°44’ N, 117°15’ W), CA. Animals were maintained in mass cultures in 400 mL beakers in artificial seawater at a salinity similar to natural saltwater concentrations (35 ppt, Instant Ocean, Aquarium Systems), and fed powdered commercial flake fish food as well as natural algae growth. Cultures were kept in incubators at 20 °C with a 12 h light:dark cycle. Males and females used in crosses were randomly sampled from culture beakers (and not from isofemale lines), so crosses between the two populations included some of the genetic diversity of both natural populations. Females from the SD population were crossed to males from the AB population in 24 well culture plates, with a single pair of copepods in each well. Virgin females were obtained by separating females from clasped pairs (Burton 1985), and their non-mated status was confirmed by monitoring them in individual wells over a week, at which point AB males were added to each well. F1 hybrids were separated into individual wells before they reached sexual maturity, to prevent siblings from mating with each other. F1 × F1 crosses were setup with a single pair per well again, and outcrossing was insured by crossing siblings from one cross to copepods from as many different crosses as possible. In both parental and F1 × F1 crosses, male fathers were removed from the cross as soon as nauplii were observed, while females were kept in the wells to allow them to produce multiple egg clutches. Given the small size of T. californicus, it was necessary to pool individuals to acquire enough DNA for high-throughput sequencing. We collected 300 red egg sacs (just before F2 nauplii hatched) and 300 adult F2 hybrids (150 males and females). Animals were kept frozen at -80oC until all individuals were collected. DNA was isolated using the Qiagen DNeasy blood and tissue kit, with the suggested modification for extraction from insects (Qiagen). One microgram of DNA from each pool (adult and nauplii) was submitted to the UNC High Throughput Sequencing Facility where sequencing libraries were prepared using the KAPA library preparation protocol (KAPA Biotechnology). Samples were sequenced as 100-bp paired-end libraries on the Illumina HiSeq 2000. One lane of sequencing per sample yielded 112.7 million reads for nauplii and 123.3 million reads for adults (~80X coverage). Illumina reads were trimmed for quality using PoPoolation (Kofler et al. 2011a) discarding bases with Phred quality scores lower than 25, and keeping reads of at least 50-bp after trimming. Generation of AB Consensus Reference Tigriopus californicus has a reference genome for the SD population, and we created an AB like reference from the mapping file (BAM) available online (https://i5k.nal.usda.gov/Tigriopus_californicus; v1.0). The consensus AB reference was extracted from the BAM file using the Samtools and Bcftools pipeline (Li et al. 2009; Li 2011). This consensus AB reference has all (or nearly all) of the AB SNPs, but the structure of the reference is still SD like. Since the AB reference is of “lower quality” (it has more “N”s), we made the SD reference comparable by adding “N”s to any position where the AB reference also had an “N” using custom Python scripts. A hybrid assembly was also created by combining the 2 populations’ references, so that each scaffold in the assembly had a SD and an AB version. SNP Database Between Parental Populations We identified SNPs between the parental populations by mapping their reads to the other population’s reference using BWA with default parameters; raw reads for each population were retrieved from GenBank (SD: SRX469409; AB: SRX2746703). PoPoolation2 (Kofler et al. 2011b) was used to find positions across the genome where all reads had an alternative nucleotide to that of the reference, considering only bialelic positions with coverage > 15X (SD mapped to AB) and > 20X (AB mapped to SD). We chose different lower coverage cutoff because the depth of coverage differed between the populations (median coverage: 34X for SD mapped to AB and 45X for AB mapped to SD). These two lists were then compared and SNPs were kept if, in reciprocal comparisons, the alternative base of one mapping was the reference base of the other. This procedure should yield only SNPs that are fixed (or nearly fixed) between the 2 populations. We will refer to this list as the parental SNP list. Hybrid Read Mapping and Determination of Allele Frequency Variance and Biases Sequence reads from F2 hybrid adults and nauplii were mapped to both parental genomes separately using BWA MEM with default parameters (Li and Durbin 2009), and to the hybrid reference genome using CLC Workbench (CW) with default match score and mismatch penalty parameters, using length fraction = 0.8 and similarity fraction = 0.99. In all cases, only reads that mapped with MAPQ score > 20 were retained. Total read counts for each SNP position, as well as population specific allele counts were determined using PoPoolation 2, keeping only biallelic positions with minimum coverage of 50X for the nauplii and 42X for adults. The resulting SNP list was compared to the parental SNP list, keeping only positions that occurred in both lists. We then compared the SNP lists between nauplii and adults. Keeping only SNP that occurred in both lists, so that all analyses were done on the same number of SNPs. We used the nauplii dataset to estimate the level variance in allele frequency estimation across the genome as well as any possible biases that result from mapping hybrid reads to each individual parental reference or the hybrid reference. Previous studies have shown that there is little to no deviations in the expected genotypic ratios of F2 individuals when they are surveyed shortly after hatching (Willett and Berkowitz 2007; Foley et al. 2013; Willett et al. 2016). Therefore, we expect allele frequencies to be close to 0.5 across the genome with low variance in this dataset. We looked at the distribution of allele frequencies for all SNPs and for different levels of smoothing by averaging over multiple SNPs. This smoothing of the data was done by averaging consecutive SNPs in nonoverlapping windows of 500, 1000, 2000, and 3000 SNPs. Determination of Genome-Wide Patterns of Hybrid Inviability Regions associated with hybrid inviability were determined by looking for portions of the genome where the observed allele frequency deviated from the expected allele frequency. An artifact of mapping hybrid reads to each parental genome is that, on average, allele frequencies will tend to be skewed towards the population to which they are mapped. This is a consequence of alleles of one population being more similar to their own population’s reference, and mapping with better alignment scores. This bias will cause the expected allele frequency for the hybrids (null expectation) to be different than 0.5 when mapped to either individual parental population. Given the results from our assessment of allele frequency variance and bias discussed above (also see Results section), the best way to minimize this bias was to average the allele frequencies for each SNP between the mappings to each parental reference. Therefore, all of the following analyses were done with allele frequencies calculated by averaging these values between mappings to each parental population. The change in allele frequency that is expected due to inviability in the F2 generation in pooled individuals is relatively small. For example, if 1 homozygous genotype is completely lethal in F2 hybrids, the allele frequency would change to either 0.333 or 0.667 (assuming the other genotypes are not selected against). This means that statistical tests that use count data (number of reads with SNP variant from each population) will have low power to detect deviations, unless coverage is very high. For example, with a high coverage of 100X, a chi-square test would only detect deviations from 0.5 when the allele frequency reached 0.4 or 0.6 (at P < 0.05). Such a skew would be the product of a very strong deviation in F2 hybrids, and we would ideally like to detect weaker deviations. Since we do not expect abrupt changes in allele frequency across proximate sequences within a chromosome by this generation, we can average the allele frequency of a large number of consecutive SNPs, which gives us accurate allele frequency estimates that change smoothly across each chromosome. We use these allele frequencies to calculate upper and lower allele frequency thresholds beyond which we consider regions of the chromosome to be skewed. We compared the distribution of allele frequencies within each chromosome between the nauplii and adult dataset using the Kolmogorov–Smirnov test (KS-test) to determine if allele frequencies from each chromosome between the datasets were drawn from the same distribution. Anchoring of Scaffolds to Chromosomes A newer reference genome for the SD population has recently become available (v2), so we used this reference to anchor and order the scaffolds from the reference assembly used in the present study (v1). Scaffolds from the v1 assembly were blasted to the v2 assembly and the subject start and end alignment positions (positions in the v2 assembly) were used to anchor and order the v1 scaffolds into the 12 T. californicus chromosomes. We performed this step to increase the percentage of the genome that is anchored to chromosomes, since in the v1 assembly only ~30% of the genome could be assigned to a chromosome. We can now assign 97% of the v1 reference length to chromosomes based on the v2 assembly (v1 assembly length: 184 Mbp; v2 assembly length: 191 Mbp). In the process of anchoring the v1 scaffolds it was determined that a few of these scaffolds were misassembled (this was also noted with a couple of the genotyping markers [see below]). Some of these misassembled scaffolds can be observed as sharp changes in allele frequency in the adult dataset, something that is not expected to happen for the F2 hybrid generation, since not enough recombination happens in two generations of hybridization. We removed these positions from our analysis, but it is apparent that a few other misassembled scaffolds exist and cause some noise in the data. We choose to ignore these at this point as they appear to have little to no effect in our analyses. Genotyping of Chromosome Markers and Candidate Hybrid Inviability Regions To verify the results obtained in the genome-wide comparison, we genotyped individual adult F2 hybrids for at least one marker per chromosome through PCR-based reactions with population specific fragment sizes. To do this, a second set of SD female × AB male crosses was setup in the same way as above, except that DNA was isolated from individual F2 hybrids using a proteinase-K cell lysis buffer method (as in [Willett and Berkowitz 2007]) rather than pooling individuals. An equal number of adult F2 males and females (174 total) were genotyped through PCR reactions with population specific primers (see Supplementary Table S1 for primer sequences), using an annealing temperature of 56 °C for all markers. We genotyped individuals at 14 loci, and deviations from expected 1:2:1 Mendelian genotypic ratios were determined through a chi-square test for each marker, with significance determined after a Bonferroni correction for multiple comparisons. Allele frequencies from the genotype data was also calculated and compared to the allele frequency from the Pool-seq data for their respective positions. In some cases, the primer location for the PCR markers happened to be in portion of a misassembled scaffold where there were not 3000 consecutive SNPs to calculate the allele frequency relative to that marker. Therefore, we present the data for all markers and for the subset of markers that are not in regions where scaffolds are misassembled. Results Assessment of Mapping Bias We used the nauplii dataset to estimate the level of mapping bias that results from mapping reads to each individual parental genome separately as well as to a hybrid reference. When reads were mapped to the hybrid reference, the median allele frequency was 0.476 in reference to the AB allele frequency (Table 1; Supplementary Figure S1a). This is not surprising as the AB reference contains AB SNP variants; but, structurally, it is still SD like. Therefore, SD alleles will map with higher scores on average, causing the allele frequency to be skewed towards higher SD allele frequency on average. A hybrid assembly method would be appropriate if there were 2 de novo reference assemblies of similar quality, but in this case, it appears to be no different than mapping reads to the SD reference (see below). As expected, when reads were mapped to the SD reference the median allele frequency was skewed from the expected 0.500 to 0.477 in reference to the AB allele frequency, while it was 0.519 when reads were mapped to the AB reference (Table 1; Supplementary Figure S1b and c). When we average the allele frequencies between mapping to SD and AB references, the distribution centers at 0.498 (Figure 1a). A similar pattern is observed for the F2 adult dataset when mapping to the individual populations’ reference (Table 1; Supplementary Figure S1d and e), with allele frequency centering at 0.505 after averaging between mappings (Figure 1b). We, therefore, chose to do all subsequent analysis by mapping hybrid reads to both parental references separately, and averaging allele frequencies between the two mappings. Table 1. Summary statistics for allele frequency and coverage distributions Allele frequency Coverage Number of SNPs Minimum Q1 Median Mean Q3 Maximum Minimum Q1 Median Mean Q3 Maximum Nauplii to hybrid reference 0.010 0.386 0.476 0.465 0.556 0.990 50.00 63.00 74.00 74.37 86.00 100.00 1956049 Nauplii to SD reference 0.008 0.406 0.477 0.472 0.544 0.987 50.00 72.00 85.00 85.79 100.00 124.00 2728082 Nauplii to AB reference 0.008 0.453 0.519 0.519 0.585 0.992 50.00 72.00 86.00 86.29 100.00 124.00 2745667 Nauplii average between references 0.013 0.434 0.498 0.496 0.560 0.982 50.00 73.00 85.50 86.05 99.00 124.00 1853450 Adults to SD reference 0.009 0.368 0.482 0.482 0.596 0.991 42.00 60.00 76.00 76.67 93.00 116.00 2271646 Adults to AB reference 0.009 0.417 0.528 0.528 0.640 0.991 42.00 60.00 76.00 76.94 93.00 116.00 2286768 Adults average between references 0.009 0.397 0.505 0.505 0.614 0.990 42.00 61.00 76.00 76.61 92.00 116.00 1853450 Adults subset of nauplii 0.475–0.525 0.013 0.402 0.508 0.508 0.615 0.990 — — — — — — 393994 Allele frequency Coverage Number of SNPs Minimum Q1 Median Mean Q3 Maximum Minimum Q1 Median Mean Q3 Maximum Nauplii to hybrid reference 0.010 0.386 0.476 0.465 0.556 0.990 50.00 63.00 74.00 74.37 86.00 100.00 1956049 Nauplii to SD reference 0.008 0.406 0.477 0.472 0.544 0.987 50.00 72.00 85.00 85.79 100.00 124.00 2728082 Nauplii to AB reference 0.008 0.453 0.519 0.519 0.585 0.992 50.00 72.00 86.00 86.29 100.00 124.00 2745667 Nauplii average between references 0.013 0.434 0.498 0.496 0.560 0.982 50.00 73.00 85.50 86.05 99.00 124.00 1853450 Adults to SD reference 0.009 0.368 0.482 0.482 0.596 0.991 42.00 60.00 76.00 76.67 93.00 116.00 2271646 Adults to AB reference 0.009 0.417 0.528 0.528 0.640 0.991 42.00 60.00 76.00 76.94 93.00 116.00 2286768 Adults average between references 0.009 0.397 0.505 0.505 0.614 0.990 42.00 61.00 76.00 76.61 92.00 116.00 1853450 Adults subset of nauplii 0.475–0.525 0.013 0.402 0.508 0.508 0.615 0.990 — — — — — — 393994 Each row refers to F2 nauplii or adult hybrid reads when mapped to different references, averaging allele frequencies between reference mappings, and the subset of adults SNPs with nauplii allele frequency between 0.475 and 0.525. Q1 and Q3 refer to the 1st and 3rd quartiles, respectively. View Large Table 1. Summary statistics for allele frequency and coverage distributions Allele frequency Coverage Number of SNPs Minimum Q1 Median Mean Q3 Maximum Minimum Q1 Median Mean Q3 Maximum Nauplii to hybrid reference 0.010 0.386 0.476 0.465 0.556 0.990 50.00 63.00 74.00 74.37 86.00 100.00 1956049 Nauplii to SD reference 0.008 0.406 0.477 0.472 0.544 0.987 50.00 72.00 85.00 85.79 100.00 124.00 2728082 Nauplii to AB reference 0.008 0.453 0.519 0.519 0.585 0.992 50.00 72.00 86.00 86.29 100.00 124.00 2745667 Nauplii average between references 0.013 0.434 0.498 0.496 0.560 0.982 50.00 73.00 85.50 86.05 99.00 124.00 1853450 Adults to SD reference 0.009 0.368 0.482 0.482 0.596 0.991 42.00 60.00 76.00 76.67 93.00 116.00 2271646 Adults to AB reference 0.009 0.417 0.528 0.528 0.640 0.991 42.00 60.00 76.00 76.94 93.00 116.00 2286768 Adults average between references 0.009 0.397 0.505 0.505 0.614 0.990 42.00 61.00 76.00 76.61 92.00 116.00 1853450 Adults subset of nauplii 0.475–0.525 0.013 0.402 0.508 0.508 0.615 0.990 — — — — — — 393994 Allele frequency Coverage Number of SNPs Minimum Q1 Median Mean Q3 Maximum Minimum Q1 Median Mean Q3 Maximum Nauplii to hybrid reference 0.010 0.386 0.476 0.465 0.556 0.990 50.00 63.00 74.00 74.37 86.00 100.00 1956049 Nauplii to SD reference 0.008 0.406 0.477 0.472 0.544 0.987 50.00 72.00 85.00 85.79 100.00 124.00 2728082 Nauplii to AB reference 0.008 0.453 0.519 0.519 0.585 0.992 50.00 72.00 86.00 86.29 100.00 124.00 2745667 Nauplii average between references 0.013 0.434 0.498 0.496 0.560 0.982 50.00 73.00 85.50 86.05 99.00 124.00 1853450 Adults to SD reference 0.009 0.368 0.482 0.482 0.596 0.991 42.00 60.00 76.00 76.67 93.00 116.00 2271646 Adults to AB reference 0.009 0.417 0.528 0.528 0.640 0.991 42.00 60.00 76.00 76.94 93.00 116.00 2286768 Adults average between references 0.009 0.397 0.505 0.505 0.614 0.990 42.00 61.00 76.00 76.61 92.00 116.00 1853450 Adults subset of nauplii 0.475–0.525 0.013 0.402 0.508 0.508 0.615 0.990 — — — — — — 393994 Each row refers to F2 nauplii or adult hybrid reads when mapped to different references, averaging allele frequencies between reference mappings, and the subset of adults SNPs with nauplii allele frequency between 0.475 and 0.525. Q1 and Q3 refer to the 1st and 3rd quartiles, respectively. View Large Figure 1. View largeDownload slide Allele frequency distributions for F2 nauplii and adult hybrids. Allele frequencies for all SNPs in the nauplii (a) and adult datasets (b); allele frequencies following a sliding window averaging of 3000 consecutive SNPs for nauplii (c) and adult (d), respectively. Figure 1. View largeDownload slide Allele frequency distributions for F2 nauplii and adult hybrids. Allele frequencies for all SNPs in the nauplii (a) and adult datasets (b); allele frequencies following a sliding window averaging of 3000 consecutive SNPs for nauplii (c) and adult (d), respectively. The mapping bias that is observed when mapping hybrid reads to the different references affects where the distribution is centered, but there is still a large amount of variation in allele frequency estimates going from nearly 0 to nearly 1. Although both the nauplii and adult datasets have these extreme allele frequency values, the distribution for the nauplii is tighter than for the adults. This is evidence that allele frequencies have higher variation across the genome for adults than for nauplii, which is expected as a consequence of hybrid inviability (Figure 1a and b). However, the distribution of allele frequency values is much larger that would be expected. Some of this variance is simply due to the sampling of alleles from the pool of DNA during library prep and sequencing, and allele frequency estimates should be affected by the coverage at each position. For example, with a coverage of 50X the 95% confidence interval for an expected 50:50 sampling ratio ranges from 35.53 to 64.47 and for 100X from 39.83 to 60.17 (Rohlf and Sokal 1969). We were also interested in determining if some of the variance in the allele frequency distribution was due to portions of the references (both AB and SD), favoring the alignment of reads from one of the populations. In other words, do certain SNPs consistently yield more accurate allele frequency estimates when different datasets are analyzed? To look at this, we selected SNPs with allele frequencies between 0.475 and 0.525 in the nauplii dataset, and compared them to the allele frequencies of the same SNPs in the adult dataset. If some of the variance in allele frequencies observed in the nauplii dataset were due to mapping biases favoring alleles from one of the populations in certain parts of the reference, excluding SNPs in these regions should also decrease the variance in the adult dataset. This was not the case, as the adult dataset had nearly identical distributions before and after the subsampling (Table 1; Supplementary Figure S1f). This indicates that the variance in allele frequency observed is not due to post sequencing processing and analysis, but likely as a consequence of sampling during library prep and sequencing as well as the depth of coverage. Allele Frequency Smoothing Next, we used nonoverlapping sliding window averaging of allele frequencies to decrease the variance in the allele frequency estimates. Any real changes in allele frequency observed at the F2 generation is expected to be smooth and impact large sections of chromosomes since there is only the opportunity for one generation of recombination in one sex (only males recombine in T. californicus) from the F1 to the F2 generation. We tested window sizes of 500, 1000, 2000, and 3000 SNPs. All window sizes show a similar pattern of allele frequency change, but smaller window sizes have more variance (Supplementary Table S2; Supplementary Figures S2 and S3). No information appears to be lost when using a window size of 3000 SNPs; and, as this window size gives us the smoothest data, we use it for the remainder of the analyses. This windows size yielded 607 windows, with an average of 294566 bp per window (STDEV: 70295; (Supplementary Table S2). Descriptive Analysis and Validation of Method The median depth of coverage was 85.5X for the nauplii dataset and 76.0X for the adults dataset. We considered SNPs with coverage greater than 50X for nauplii and 42X for the adults, keeping only those present in the parental SNP list and that also occurred on both nauplii and adult datasets for a total of 1853450 SNPs (Table 1). We compared the nauplii and adult datasets, as a way to determine if our method could detect changes in allele frequency due to hybrid inviability. Our results showed that no chromosome regions display strong deviations from expected neutral allele frequencies for the F2 nauplii, while several large regions of the genome were skewed in the adult dataset (Figure 2a and b). A KS-test suggests allele frequencies were drawn from different distributions for the full datasets (Figure 1c and d), as well as for all but one chromosome (c4), indicating allele frequencies in adults differ significantly from nauplii (Supplementary Table S3). Four chromosomes (c1, c3, c8, and c10) show strong skews, with allele frequencies shifting beyond 0.050 from the expected 0.500, while the other chromosomes have allele frequency shifts between 0.025 and 0.050 away from the null expectation. Figure 2. View largeDownload slide Allele frequency and genotype barplots plots for SD × AB F2 hybrids. Allele frequencies are based on the allele frequency of the AB alleles (y-axis). The x-axis indicates the relative position across each chromosome. Data points are the average allele frequency of 3000 consecutive SNPs; (a) F2 nauplii; (b) F2 adults. Arrows in SD x AB adults plot (b) indicate genomic positions that were genotyped for individual F2 hybrids; (c) Relative genotypic frequencies for SD × AB from genotyping of adult F2 hybrid individuals. Each bar represents a genotyped marker. Asterisks indicate markers for which the genotypic ratio deviates significantly from the expected 1:2:1 Mendelian ratio after Bonferroni correction (P < 0.003). Dots indicate deviations from this ratio at P < 0.05. Figure 2. View largeDownload slide Allele frequency and genotype barplots plots for SD × AB F2 hybrids. Allele frequencies are based on the allele frequency of the AB alleles (y-axis). The x-axis indicates the relative position across each chromosome. Data points are the average allele frequency of 3000 consecutive SNPs; (a) F2 nauplii; (b) F2 adults. Arrows in SD x AB adults plot (b) indicate genomic positions that were genotyped for individual F2 hybrids; (c) Relative genotypic frequencies for SD × AB from genotyping of adult F2 hybrid individuals. Each bar represents a genotyped marker. Asterisks indicate markers for which the genotypic ratio deviates significantly from the expected 1:2:1 Mendelian ratio after Bonferroni correction (P < 0.003). Dots indicate deviations from this ratio at P < 0.05. Next, we were interested in determining how the Pool-seq allele frequency estimates compared to allele frequencies derived from single-locus genotyping of individuals. We genotyped 174 F2 adult individuals of this same cross at 14 loci with at least one marker per chromosome, and calculated the allele frequency for each locus. In general, the genotyping estimates were very similar to the allele frequency in the Pool-seq data suggesting our allele frequencies estimates are likely very close to the actual allele frequencies for the sample (Figure 3; Supplementary Table S3). One of the purposes of the present study, was to identify levels of allele frequency change in F2 hybrids that can be confidently used as a threshold to consider allele frequencies significantly skewed without having to compare it to a dataset where skews are not expected (e.g. nauplii or a F1 hybrid dataset). Therefore, we also looked for deviations from expected 1:2:1 Mendelian genotypic ratios using a chi-square test in the genotyped individuals, as a way to determine levels of allele frequency change that were significantly skewed in both methods. Markers in five chromosomes showed skewed genotypic ratios, and four of these markers were in regions of chromosomes where the Pool-seq allele frequency was shifted by > 0.050 from the expected 0.500 (Figure 2b and c; Supplementary Table S3). For the one marker where that was not the case (in chromosome 5), both the AB and SD homozygous genotypic classes were underrepresented, but the allele frequency estimates were similar between the 2 methods (see Discussion). Given the results from the Pool-seq data combined with the individual genotyping data, it seems that allele frequency changes greater than 0.050 away from the expect 0.500 can be used as a threshold for detecting regions of the genome associated with hybrid inviability. Figure 3. View largeDownload slide Correlation between allele frequency estimates using the Pool-seq method and the genotyping data. R2 = 0.946 and slope = 1.018. PCR markers c1-c4L, and c11-septub map to small regions of misassembled scaffolds where there are not 3000 SNPs. As this may cause the allele frequency estimates to be less reliable, they were excluded from this comparison. See Supplementary Table S3 for all values. Figure 3. View largeDownload slide Correlation between allele frequency estimates using the Pool-seq method and the genotyping data. R2 = 0.946 and slope = 1.018. PCR markers c1-c4L, and c11-septub map to small regions of misassembled scaffolds where there are not 3000 SNPs. As this may cause the allele frequency estimates to be less reliable, they were excluded from this comparison. See Supplementary Table S3 for all values. Discussion In the present study, we tested if a Pool-seq approach could be used to detect genomic regions that are affected by hybrid inviability. Our results show that this method has enough power to detect even small deviations in allele frequency caused by hybrid inviability. Comparisons between nauplii and adult F2 hybrid pools from the SD × AB cross indicates there is little to no deviation from expected Mendelian ratios in nauplii (Figure 2a), as has been previously observed in this and other crosses for most markers (Willett and Berkowitz 2007; Foley et al. 2013). However, there is a small difference in the average allele frequency across the entire genome for nauplii (0.498) versus adults (0.505), suggesting that there may be small deviations in nauplii when they hatch, which too have been observed before (Pritchard et al. 2011; Willett et al. 2016). Most importantly, this comparison between nauplii and adults shows that our method is able to detect changes in allele frequencies that are due to hybrid inviability, most of which occurs posthatching. Compared with previous studies that looked at this same cross, it is clear that some of the strongest allele frequency deviations, on chromosomes 1, 3, and 10 are always observed (Burton 1987; Willett 2011; Willett et al. 2016). The effects of hybrid inviability on chromosome 3 are particularly interesting, as it is lethal to 90% of individuals that are homozygous for the SD allele; a pattern that has been consistently observed since it was first noticed by Burton (1987) using an allozyme marker. Other loci show more variable patterns when the same cross is setup at different times (e.g. chromosomes 2 and 8). Willett et al. (2016) observed selection against SD homozygotes for a marker on chromosome 2, whereas the present study shows selection against AB homozygotes in the genotyping data in agreement with low AB allele frequency in the Pool-seq data. Chromosome 8 shows small differences between the Pool-seq and genotyping data in the present study, and from other previous studies (Willett 2011; Willett et al. 2016). This variability in the expression of incompatibilities could be caused by polymorphism in the factors that cause the incompatibility, or even due to variations in the conditions each time the crosses are setup (e.g. diet and population density in which the hybrids are raised [Willett 2008]). There appears to be little concordance between chromosomal patterns of inviability for the SD × AB cross presented here and a cross between SD and a population from Santa Cruz, CA (SC [Pritchard et al. 2011; Foley et al. 2013]). Only chromosome 1 shows selection against SD homozygotes in both cases. Chromosome 10 also shows skewed genotypic ratios in Foley et al (2013), but it is sex dependent with selection against SD homozygotes in males, and SC homozygotes in females. This is not the case in the present study, where a strong selection against AB alleles is observed in both the Pool-seq and genotyping datasets, which have an even sex ratio (Figure 2). One of our PCR markers illustrates a weakness of the Pool-seq allele frequency method: the chromosome 5 marker has skewed genotypic ratios, with less than half of the expected number of homozygous AB individuals, but there is also a deficit of SD homozygous individuals. The allele frequency in the Pool-seq approach is at 0.457, and, though there is a noticeable shift in allele frequency between adults and nauplii, it does not detect the strong level of selection against AB homozygotes. The Pool-seq method will have low power when both homozygotes are selected against (i.e. there are excess heterozygotes). This is because excess heterozygotes will maintain the allele frequency close to 0.500. These results combined suggest our method has enough power to detect skews of relatively small magnitude in allele frequency, but will have low power when both homozygous classes are selected against. We assessed if previously published studies that genotyped individuals could have benefited from this Pool-seq method (Gardiner et al. 1993 [used RFLP markers in Maize]; Harushima et al. 2001 [RFLP markers in rice]; Hall and Willis 2005 [AFLP, microsatellites and gene-based markers]). By transforming genotypic frequencies from these studies into allele frequencies, we compared the number of markers that were identified as showing segregation distortion that also showed allele frequencies that were skewed above 0.55 or below 0.45. The allele frequency method was able to detect the same skewed markers as the genotyping data in most cases, failing only in cases where both homozygous classes were selected against or overrepresented in relation to heterozygotes (as was the case for the present study; Supplementary Table S4). In a few cases, significant deviations were detected in these studies that translate to allele frequency changes to > 0.525 or <0.475. This may be a less conservative threshold that could be considered, assuming it is a consistent shift in allele frequency across most of the chromosome. Since a large number of consecutive SNPs can be averaged in F2 hybrids using our Pool-seq method, due to the small amount of recombination that occurs up to that generation, the allele frequency that is observed over large window sizes should be a good estimate of the true allele frequency for the sample sequenced. Therefore, even small changes are likely to be real to that specific sample, and it becomes a matter of determining what level of allele frequency change is relevant in each study. The Pool-seq method allows for determination of factors affecting viability in genome-wide scale, but it does not allow for the determination of which factors are interacting with each other to cause these incompatibilities, as it is not possible to determine whether there is linkage disequilibrium between physically unlinked factors. This method should be sensitive enough to detect allele frequency changes due to a simple 2-factor recessive–recessive incompatibility, which would cause allele frequency to change to approximately 0.55 and 0.45. However, more complex incompatibilities involving fully recessive factors would be harder to detect, as is the case when individuals are genotyped as well, since the proportion of individuals that have the deleterious combination of alleles across multiple loci is lowered. The Pool-seq method would however have good power to detect dominant–recessive incompatibilities (at least for the recessive factor), and maternal or paternal effects where all homozygous individuals from one genotypic class may be inviable (Beeman et al. 1992; Lorenzen et al. 2008; Seidel et al. 2008, 2011). Although genotyping individuals allows for the detection of genotypic classes that deviate from Mendelian ratios, the Pool-seq allele frequency method allows you to scan the genome of a very large number of individuals, at a much higher density of markers with very little effort, and increasingly lower costs. For example, one lane of Illumina HiSeq 4000 could produce enough sequence for four datasets with similar coverage to the ones presented here. Deviations from expected Mendelian genotypic ratios (1:2:1) agreed with deviations from Mendelian allele frequency ratios (1:1) at most markers. Some discrepancies appeared when we transformed the genotype data to allele frequencies, but in many cases these were due to problems with the reference genome being misassembled (Figure 3). It is also possible that certain incompatibilities, especially those of weak effect are not always expressed when the same populations are crossed (Willett 2008; Willett et al. 2016 and the present study). Even though chromosomes with skewed allele frequencies tend to be skewed for the entire length of the chromosome, there are “peaks” of allele frequency that may indicate the location of the incompatibility (or incompatibilities) that are causing this skew. We are currently working on identifying the location of the incompatibility that leads to the extreme allele frequency skew in chromosome 3, and the location of the incompatibility appears to match the “peak” in the allele frequency plot for that chromosome (unpublished). Therefore, it seems there may be enough signal in these plots to give us hints as to where actual incompatibilities may be within a chromosome. This type of signal would not be observed with low-resolution genotyping data. The Pool-seq method gives us a more continuous view of how allele frequency changes across the chromosomes, and may be very helpful when choosing candidate regions that can be further studied to hone in on actual incompatibilities. Supplementary Material Supplementary data are available at Journal of Heredity online. Funding This work was supported by the National Science Foundation (grant IOS-1155325 and DEB-0821003 to C.S.W). T.G.L. was supported by National Science Foundation Postdoctoral Research Fellowship in Biology (award no. 1523543) during the preparation of this manuscript. Data Availability Raw Illumina reads were deposited in NCBI SRA (SRR6391912; SRR6391913); genotyping and allele frequency data can be accessed through Dryad (10.5061/dryad.r3v52). Acknowledgments We thank R. Burton for lab space during a portion of this work. We thank M. Servedio, M. Noor, C. Jones and T. Vision for helpful discussion in the earlier stages of the study and manuscript preparation. We thank T. Pierce for writing the custom python script to equalize the quality of the genome references. References Beeman RW , Friesen KS , Denell RE . 1992 . Maternal-effect selfish genes in flour beetles . Science . 256 : 89 – 92 . Google Scholar CrossRef Search ADS PubMed Burke MK , Dunham JP , Shahrestani P , Thornton KR , Rose MR , Long AD . 2010 . Genome-wide analysis of a long-term evolution experiment with Drosophila . Nature . 467 : 587 – 590 . Google Scholar CrossRef Search ADS PubMed Burton , RS . 1985 . Mating system of the intertidal copepod Tigriopus californicus . Mar. Biol . 86 : 247 – 252 . Google Scholar CrossRef Search ADS Burton RS . 1987 . Differentiation and integration of the genome in populations of the marine copepod Tigriopus californicus . Evolution . 41 : 504 – 513 . Google Scholar CrossRef Search ADS PubMed Burton RS . 1997 . Genetic evidence for long term persistence of marine invertebrate populations in an ephemeral environment . Evolution . 51 : 993 – 998 . Google Scholar CrossRef Search ADS PubMed Dobzhansky T . 1936 . Studies on hybrid sterility. II. localization of sterility factors in Drosophila pseudoobscura hybrids . Genetics . 21 : 113 – 135 . Google Scholar PubMed Edmands S . 1999 . Heterosis and outbreeding depression in interpopulation crosses spanning a wide range of divergence . Evolution . 53 : 1757 – 1768 . Google Scholar CrossRef Search ADS PubMed Foley BR , Rose CG , Rundle DE , Leong W , Edmands S . 2013 . Postzygotic isolation involves strong mitochondrial and sex-specific effects in Tigriopus californicus, a species lacking heteromorphic sex chromosomes . Heredity (Edinb) . 111 : 391 – 401 . Google Scholar CrossRef Search ADS PubMed Futschik A , Schlötterer C . 2010 . The next generation of molecular markers from massively parallel sequencing of pooled DNA samples . Genetics . 186 : 207 – 218 . Google Scholar CrossRef Search ADS PubMed Gardiner JM , Coe EH , Melia-Hancock S , Hoisington DA , Chao S . 1993 . Development of a core RFLP map in maize using an immortalized F2 population . Genetics . 134 : 917 – 930 . Google Scholar PubMed Gautier M , Foucaud J , Gharbi K , Cézard T , Galan M , Loiseau A , Thomson M , Pudlo P , Kerdelhué C , Estoup A . 2013 . Estimation of population allele frequencies from next-generation sequencing data: pool-versus individual-based genotyping . Mol. Ecol . 22 : 3766 – 3779 . Google Scholar CrossRef Search ADS PubMed Hall MC , Willis JH . 2005 . Transmission ratio distortion in intraspecific hybrids of Mimulus guttatus: implications for genomic divergence . Genetics . 170 : 375 – 386 . Google Scholar CrossRef Search ADS PubMed Harushima Y , Nakagahra M , Yano M , Sasaki T , Kurata N . 2001 . A genome-wide survey of reproductive barriers in an intraspecific hybrid . Genetics . 159 : 883 – 892 . Google Scholar PubMed Kofler R , Orozco-terWengel P , De Maio N , Pandey RV , Nolte V , Futschik A , Kosiol C , Schlötterer C . 2011a . PoPoolation: a toolbox for population genetic analysis of next generation sequencing data from pooled individuals . PLoS One . 6 : e15925 . Google Scholar CrossRef Search ADS Kofler R , Pandey RV , Schlötterer C . 2011b . PoPoolation2: identifying differentiation between populations using sequencing of pooled DNA samples (Pool-Seq) . Bioinformatics . 27 : 3435 – 3436 . Google Scholar CrossRef Search ADS Li H . 2011 . A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data . Bioinformatics . 27 : 2987 – 2993 . Google Scholar CrossRef Search ADS PubMed Li H , Durbin R . 2009 . Fast and accurate short read alignment with Burrows-Wheeler transform . Bioinformatics . 25 : 1754 – 1760 . Google Scholar CrossRef Search ADS PubMed Li H , Handsaker B , Wysoker A , Fennell T , Ruan J , Homer N , Marth G , Abecasis G , Durbin R ; 1000 Genome Project Data Processing Subgroup . 2009 . The Sequence Alignment/Map format and SAMtools . Bioinformatics . 25 : 2078 – 2079 . Google Scholar CrossRef Search ADS PubMed Lorenzen MD , Gnirke A , Margolis J , Garnes J , Campbell M , Stuart JJ , Aggarwal R , Richards S , Park Y , Beeman RW . 2008 . The maternal-effect, selfish genetic element Medea is associated with a composite Tc1 transposon . Proc. Natl. Acad. Sci. USA . 105 : 10085 – 10089 . Google Scholar CrossRef Search ADS Muller , HJ . 1942 . Isolating mechanisms, evolution and temperature . Biol. Symp . 6 : 71 – 125 . Pereira RJ , Barreto FS , Pierce NT , Carneiro M , Burton RS . 2016 . Transcriptome-wide patterns of divergence during allopatric evolution . Mol. Ecol . 25 : 1478 – 1493 . Google Scholar CrossRef Search ADS PubMed Pritchard VL , Dimond L , Harrison JS , S Velázquez CC , Zieba JT , Burton RS , Edmands S . 2011 . Interpopulation hybridization results in widespread viability selection across the genome in Tigriopus californicus . BMC Genet . 12 : 54 . Google Scholar CrossRef Search ADS PubMed Rohlf , JF , Sokal RR . 1969 . Statistical tables . San Francisco: W. H. Freeman and Company . Schlötterer C , Tobler R , Kofler R , Nolte V . 2014 . Sequencing pools of individuals - mining genome-wide polymorphism data without big funding . Nat. Rev. Genet . 15 : 749 – 763 . Google Scholar CrossRef Search ADS PubMed Seidel HS , Ailion M , Li J , van Oudenaarden A , Rockman MV , Kruglyak L . 2011 . A novel sperm-delivered toxin causes late-stage embryo lethality and transmission ratio distortion in C. elegans . PLoS Biol . 9 : e1001115 . Google Scholar CrossRef Search ADS PubMed Seidel HS , Rockman MV , Kruglyak L . 2008 . Widespread genetic incompatibility in C. elegans maintained by balancing selection . Science . 319 : 589 – 594 . Google Scholar CrossRef Search ADS PubMed Turner TL , Stewart AD , Fields AT , Rice WR , Tarone AM . 2011 . Population-based resequencing of experimentally evolved populations reveals the genetic basis of body size variation in Drosophila melanogaster . PLoS Genet . 7 : e1001336 . Google Scholar CrossRef Search ADS PubMed Willett CS . 2008 . Significant variation for fitness impacts of ETS loci in hybrids between populations of Tigriopus californicus . J. Hered . 99 : 56 – 65 . Google Scholar CrossRef Search ADS PubMed Willett CS . 2011 . Complex deleterious interactions associated with malic enzyme may contribute to reproductive isolation in the copepod Tigriopus californicus . PLoS One . 6 : e21177 . Google Scholar CrossRef Search ADS PubMed Willett CS . 2012 . Quantifying the elevation of mitochondrial DNA evolutionary substitution rates over nuclear rates in the intertidal copepod Tigriopus californicus . J. Mol. Evol . 74 : 310 – 318 . Google Scholar CrossRef Search ADS PubMed Willett CS , Berkowitz JN . 2007 . Viability effects and not meoitic drive cause dramatic departures from Mendelian inheritance for malic enzyme in hybrids of Tigriopus californicus populations . J. Evol. Biol . 20 : 1196 – 1205 . Google Scholar CrossRef Search ADS PubMed Willett CS , Ladner JT . 2009 . Investigations of fine-scale phylogeography in Tigriopus californicus reveal historical patterns of population divergence . BMC Evol. Biol . 9 : 139 . Google Scholar CrossRef Search ADS PubMed Willett CS , Lima TG , Kovaleva I , Hatfield L . 2016 . Chromosome-Wide Impacts on the Expression of Incompatibilities in Hybrids of Tigriopus californicus . G3 (Bethesda) . 6 : 1739 – 1749 . Google Scholar CrossRef Search ADS PubMed © The American Genetic Association 2018. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/about_us/legal/notices) TI - Using Pool-seq to Search for Genomic Regions Affected by Hybrid Inviability in the copepod T. californicus JF - Journal of Heredity DO - 10.1093/jhered/esx115 DA - 2018-01-10 UR - https://www.deepdyve.com/lp/oxford-university-press/using-pool-seq-to-search-for-genomic-regions-affected-by-hybrid-YY18jagdXn SP - 1 EP - 476 VL - Advance Article IS - 4 DP - DeepDyve ER -