Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

De Novo Mutations Resolve Disease Transmission Pathways in Clonal Malaria

De Novo Mutations Resolve Disease Transmission Pathways in Clonal Malaria Detecting de novo mutations in viral and bacterial pathogens enables researchers to reconstruct detailed networks of disease transmission and is a key technique in genomic epidemiology. However, these techniques have not yet been applied to the malaria parasite, Plasmodium falciparum, in which a larger genome, slower generation times, and a complex life cycle make them difficult to implement. Here, we demonstrate the viability of de novo mutation studies in P. falciparum for the first time. Using a combination of sequencing, library preparation, and genotyping methods that have been optimized for accuracy in low-complexity genomic regions, we have detected de novo mutations that distinguish nominally identical parasites from clonal lineages. Despite its slower evolutionary rate compared with bacterial or viral species, de novo mutation can be detected in P. falciparum across timescales of just 1–2 years and evolutionary rates in low-complexity regions of the genome can be up to twice that detected in the rest of the genome. The increased mutation rate allows the identification of separate clade expansions that cannot be found using previous genomic epidemiology approaches and could be a crucial tool for mapping residual transmission patterns in disease elimination campaigns and reintroduction scenarios. Key words: genomic epidemiology, de novo mutation, malaria, transmission networks. Despite the relevance of all these scenarios to malaria Introduction transmission and the potential impact on elimination efforts, The reconstruction of transmission networks by sequencing inference of transmission networks based on de novo varia- pathogen samples is a powerful technique that has become tion has not been attempted. Genomics-based transmission increasingly routine in the field of genomic epidemiology. By analyses in malaria has have thus far relied on standing var- identifying infections with shared de novo variation or the iation (Nkhoma et al. 2013; Daniels et al. 2015). However, shortest genetic distances between samples, epidemiologists these approaches require the reassortment of common alleles are able to reconstruct detailed transmission networks in viral via sexual outcrossing and will not show discriminatory and bacterial pathogens such as flu (Neher and Bedford 2015), power in regions where transmission rates are too low (in MRSA (Harris et al. 2010; Popovich et al. 2016)orEbola (Gire which the parasite predominantly mates with itself) or in et al. 2014; Park et al. 2015). Assaying de novo mutations has situations where malaria has been reintroduced to a region proven to be a useful strategy in settings where epidemiolog- (resulting in an entirely clonal outbreak). Indeed, even within ical data is limited or in situations where the actual transmis- regions with ongoing outcrossing, clonal outbreaks persisting sion pathways are difficult to identify, and it is applicable to across multiple years have been found (Daniels et al. 2015); as any measurably evolving population (as defined by Drummond disease-endemic countries approach elimination these situa- et al. 2003). Using de novo genetic variation to track pathogen tions are likely to become more common. population movement has enabled the identification of dis- Several challenges prevent the direct application of trans- ease reservoirs underlying reemergence (Eyre et al. 2013); mission network reconstruction methodologies in assessed the relative contribution of local versus imported Plasmodium. In addition to complications introduced by sex- infections in a given setting (Azarian et al. 2016); and allowed ual outcrossing, the larger genome of P. falciparum and a slow the identification of the rate and times of transmission across mutation rate relative to prokaryotic or viral pathogens national borders (Faria et al. 2017; Metsky et al. 2017). reduces our ability to accurately resolve de novo mutations The Author(s) 2018. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/ licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is Open Access properly cited. 1678 Mol. Biol. Evol. 35(7):1678–1689 doi:10.1093/molbev/msy059 Advance Access publication May 1, 2018 Downloaded from https://academic.oup.com/mbe/article/35/7/1678/4962453 by DeepDyve user on 20 July 2022 Disease Transmission Pathways in Clonal Malaria doi:10.1093/molbev/msy059 MBE between individuals. The most rapidly mutating loci are clones, each with an associated genome assembly. These found within low complexity sequence, and commonly techniques were then applied to patient samples collected within short tandem repeats (STRs). Mutation rates for from a single clinic in Thie `s, Senegal representing three sep- STRs can be as high as 3.77 10 mutations/locus/genera- arate clonal expansions circulating in the region. These clonal tion (Chenet et al. 2015), many orders of magnitude higher samples were genotyped using a 24 SNP barcode (Daniels than the estimated rate for SNPs in non-STR sequence (1.07 et al. 2015) and were previously determined to be identical 10 substitutions/locus/generation [Lemieux 2015]). STR by fixed locus genotyping of commonly genotyped polymor- mutations are themselves associated with increased mutation phic genes (MSP1/MSP2/TRAP/CSP;A. Bei, personal rates in other organisms (Lenz et al. 2014) and low-complexity communication). sequences in P. falciparum show greatly increased evolution- 250-bp Illumina Reads Increase Genotype Specificity ary rates compared with high-complexity flanking sequences in Low-Complexity Regions (Zilversmit et al. 2010). These highly mutable, low complexity regions within the P. Genome Accessibility falciparum genome are difficult to access using conventional The effect of read length and calling algorithm on genome short-read sequencing approaches. Commonly used sequenc- accessibility was assessed by comparison of DISCOVAR ing formats (100 bp paired-end reads within 300 bp frag- (which is restricted to 250 bp reads) and GATK ments) prohibit high quality alignments in repetitive HaplotypeCaller with both 100 bp (hereafter HC100) and sequence, particularly when the size of the repeat region 250 bp reads (HC250). To assess genome accessibility we gen- approaches or exceeds the read length. Where insertion/de- erated an in silico library from one genome assembly (Dd2) letion (INDEL) mutations are found within STRs they can be aligned this to another referencegenomeassembly (3D7)and impossible to detect, as insufficient nonrepetitive flanking assessed our ability to accurately reconstruct the second ref- sequence around the variant can lead to incorrect partial erence genome. A greater proportion of the P. falciparum alignments or a complete failure to map reads (Narzisi and genome was found to be “accessible” to DISCOVAR Schatz 2015). (78.4%) and HC250 (79.3%), than was accessible to HC100 By spanning long repeats, longer read lengths will greatly (64%), indicating the importance of longer reads. This differ- improve our ability to analyze variation in low complexity ence in accessibility represents >3.5 Mb: the majority of the genomic regions (including SNPs, STR modifications and genome outside of the mitotically recombining telomere and non-STR associated INDELs) and may, along with associated subtelomere sequence and intercalary heterochromatin algorithmic advances in variant calling, enable studies of de (Flueck et al. 2009;see fig. 1). As expected, low-complexity novo variation in eukaryotes (Weisenfeld et al. 2014). regions also show a marked improvement in accessibility; of 10.8 Mb of sequence identified as low-complexity by Applying such an approach to clinical samples representing clonal expansions in Senegal, we show that de novo mutation DustMasker (Morgulis et al. 2006), 80% is accessible using can be reliably called in a eukaryotic parasite at a specificity either DISCOVAR or HC250 compared with 65% accessible that makes it amenable for determining phylogenetic rela- using HC100. tionships. We further demonstrate that P. falciparum can be considered a measurably evolving population (i.e., a popula- Marker Validation tion that can be sampled at different points in time generat- Following the in silico analysis of genome accessibility we ing a statistically significant number of genetic differences— assessed specificity and sensitivity of all three genotyping Drummond et al. 2003)and is thereforeasuitablesubject for approaches by comparing a sequenced DD2-2D4 sample to transmission network reconstruction. The wider use of these a validated set of genotype calls from the Pf-Crosses variant techniques would greatly increase the resolution of genomic set (Miles et al. 2016; Pf-Crosses data available at https://www. epidemiology studies in malaria and could have a dramatic malariagen.net/data/pf-crosses-1.0). Receiver-Operator impact on efforts to eliminate the disease. Characteristic (ROC) curves were generated for variants in regions that were accessible to all callers (the “core” genome) Results and variants found in the additional genome accessible to To increase our ability to detect rapidly arising mutations in that caller (the “extended” genome; supplementary fig. S1, the P. falciparum genome, a modified library preparation Supplementary Material online]. Specificity rates were similar method was employed in which large volumes of DNA across all callers, with DISCOVAR being the most specific, were prepared without a PCR step (minimizing the introduc- albeit at a significant cost to sensitivity (DISCOVAR precision tion of amplification errors); 250 bp Illumina reads were gen- 0.34/sensitivity 0.39; HC250 0.28/0.87; HC100 0.31/0.88); these erated on a size-selected 450 bp fragment (ensuring results were similar when we included the extended genome overlapping reads that could be joined into one long frag- (DISCOVAR 0.33/0.38; HC250 0.31/0.87; HC100 0.28/0.86) ment); these reads were genotyped with DISCOVAR, an al- suggesting that there is little drop in specificity in low- gorithm designed specifically for this data type (Weisenfeld complexity regions of the genome. It is notable that the et al. 2014), as well as GATK HaplotypeCaller (DePristo et al. ROC curves for HC250 and DISCOVAR show “false positive” 2011). The extent and accuracy of these techniques were variants even at the highest VQSLOD scores and in all regions assessed using laboratory isolates of well-defined parasite of the genome, which likely represent real variants that were 1679 Downloaded from https://academic.oup.com/mbe/article/35/7/1678/4962453 by DeepDyve user on 20 July 2022 Redmond et al. doi:10.1093/molbev/msy059 MBE a) b) c) 1% 9% 18% 1% 7% 1% 10 4% 6 0 59% DISCO HC250 HC100 DISCO HC100+HC250 HC250 DISCO+HC250 HC100 DISCO+HC100 INACCESSIBLE ACCESSIBLE INACC.(TELOMERE) 0e+00 1e+06 2e+06 3e+06 position (bp) FIG.1. Genome accessibility was assessed based on in silico comparisons of two P. falciparum genome assemblies. Variants were called from reads simulated from the Pf_Dd2 reference and aligned to the Pf3D7_v3 reference and assessed as correct or incorrect based on a 200 bp flanking region on either side. Blocks of 1 kb across the genome were defined as accessible only if they had a read coverage within two standard deviations of the median and if they had called all variants accurately. Comparisons were made between DISCOVAR and HaplotypeCaller with both 250 bp (HC250) and 100 bp reads (HC100). Subplots show (a) the genome-wide distribution of accessible blocks, (b) the proportions accessible to combinations of callers, and (c) the overall genome area accessible to each caller. The strong overlap of HC250 and DISCOVAR calls suggests that the majority of this improvement derives from the use of longer reads; accessibility increased from 64.0% with HC100 to 78.4% with DISCOVAR and 79.3% with HC250. 10.8 Mb of low complexity sequence was identified by DustMasker, of which 80.8% 81.4%, and 65.4% was found to be accessible by DISCOVAR, HC250, and HC100, respectively. not callable with the 100 bp reads that were used for the Pf- tree, we compared the proportion of variants that supported crosses study. each bipartition, contradicted it (requiring a homoplasy, re- version or recombination to explain the split), or were irrel- Higher Specificity Genotyping Produces Robust evant (e.g., a variant contained entirely within one clade). The Phylogenetic Inferences mean proportion of variants that support the phylogeny of We prioritized specificity for subsequent analyses given the the largest clade was 0.75. This result is consistent with a high motivating application for these data was distance-based specificity and lower sensitivity genotyping method and does transmission network reconstruction methods, which are not suggest any recombination has taken place in these sam- sensitive to false positive calls. Using the combination of ples. Despite the relatively short sampling times covered (only 250 bp reads and DISCOVAR variant calls, we genotyped 6 years in the largest clade) bootstrap support for the samples from a clinical isolate collection from Senegal, repre- DISCOVAR phylogeny is well above 50% for most within- senting three clades (24, 26, and 29—Daniels et al. 2015). clade relationships (mean bootstrap value 0.71); samples Samples within these clades were indistinguishable by previ- with low bootstrap support were separated by no more ous techniques and represented independent infections col- than a year. lected over multiple years. To maximize specificity, De Novo Variants Can Distinguish IBD Parasites DISCOVAR calls were produced independently from each of two lanes of sequencing data, allowing discordant geno- For subsequent analyses, we have worked under the assump- tion that the samples within clades have not undergone mei- types to be discarded and further reducing the absolute abun- otic outcrossing with other lineages from the same clade. The dance of false positive calls. A phylogeny was calculated using maximum parsimony based on the Manhattan distance of Thie `s region is characterized by low transmission and a pre- vious study of 974 samples from this region, covering the SNPs and INDELs combined (SNP/INDEL evolutionary rates same period of time, showed a low prevalence of complex are not known a priori; fig. 2 and supplementary fig. S2, Supplementary Material online). (multilineage) infections (Galinsky et al. 2015). Although we A Lento/bipartition plot (Lento et al. 1995)was generated cannot absolutely rule out recombination between IBD indi- viduals, complexinfectionsareaprerequisite forrecombina- for this phylogeny (supplementary fig. S3 and table ST2, Supplementary Material online); for each bipartition of the tion. Given the low population prevalence of parasites chromosome Mb Downloaded from https://academic.oup.com/mbe/article/35/7/1678/4962453 by DeepDyve user on 20 July 2022 Disease Transmission Pathways in Clonal Malaria doi:10.1093/molbev/msy059 MBE Th162.12 Th162.12 a) b) 0.91 Th230.12 Th230.12 0.47 Th196.12 Th196.12 Th132.11 Th132.11 0.36 Th074.13 Th074.13 Th106.09 Th106.09 0.84 Th117.11 Th117.11 0.7 Th134.11 Th134.11 Th106.11 Th106.11 Th086.07 Th086.07 Th211.13 Th211.13 0.25 Th246.13 Th246.13 0.43 Th092.13 Th092.13 0.77 Th166.12 Th166.12 Th245.13 Th245.13 Th061.13 Th061.13 0.99 Th095.13 Th095.13 100 (vars) Th068.12 Th068.12 FIG.2. Phylogenetic resolution of parasite samples using standing variation versus de novo variants. (a) Maximum parsimony tree of standing variation (24-SNP molecular barcode—Daniels et al. 2015). The last two numerals in sample names indicate the year of collection. Clades are resolved, but samples within clades are indistinguishable. (b) More than 3,000 de novo variants were called via DISCOVAR that segregated the individuals into three IBD clades. Samples within clades 24 (red), 26 (blue), and 29 (green) were separated by 622, 828, and 1858, variants, respectively, with the closest individuals (Th106.09/Th074.13) distinguished by 88 de novo variants. Phylogenetic trees were calculated using 15276 SNPs and 11420 INDELs using maximum parsimony. Numbers on nodes indicate bootstrap support. Our ability to discern phylogeny using only de novo variants was high with bootstrap values above 0.5 for all nodes subtending samples collected at >1 year apart. belonging to each clade, and the attendant low probability descended from a single common ancestor without recom- that asinglehostwould have been simultaneously infected bination, the clades were therefore considered identical by by two parasite lineages of the same clade, we conclude that descent (IBD). Within clades, pairwise distances ranged from outcrossing is unlikely to have occurred. Variants found within 88 to 677 variants, with both extremes being present in a clonal expansion are therefore likely to represent de novo clade 29. As might be expected, the oldest sample mutation. Of the 26,696 variants that distinguished the three (Th086.07, collected in 2007) exhibited a greater genetic clades from each other, 3,200 segregated within the clades distance with respect to the other samples in clade 29, (622, 828, and 1858 for clades 24, 26, and 29, respectively). though this pattern was not repeated in the other two Some loci segregated within more than one clade (9 loci in clades consisting of more closely contemporaneous sam- clades 24 and 26; 39 in 24 and 29; 57 in 26 and 29; 1 locus ples (2012–2013; fig. 2). segregated in all three clades) (supplementary fig. S2, Variants were found predominantly in intergenic sequence Supplementary Material online); though many of these loci for both SNPs (78.5%) and INDELs (83.7%); within coding showed different alleles in different clades, we are unable to sequence in-frame INDELs and missense SNPs generate a fur- distinguish between independent mutations versus shared ther 15% of the variants in each class (table 1) indicating that variation. As might be expected, variants in the extended ge- selection would not significantly confound phylogeny recon- nome showed a higher degree of miscalling, with each clonal struction. Perhaps surprisingly, whereas we would expect the lineage having a different degree of miscalling—indicating a majority of sequencing errors to be observed as singletons, haplotype specific effect (supplementary table ST1, the proportions of nonsense variants did not increase when Supplementary Material online). Genes that are frequently considering only singletons, indeed the proportion of inter- used for the study of parasite diversity were also examined; genic variants significantly increased when considering only no variants of any type were found within any of the clades for singletons. This could represent terminal-branch variants in MSP1, MSP2,SERA2, TRAP,or CSP consistent with the earlier the rapidly mutating intergenic regions or could imply that sequencing carried out on these samples (A. Bei—personal false positive variants deriving from alignment error (which communication). The uniformity of these markers within an would be less likely to be found as singletons) are more prev- otherwise panmictic population indicated that each clade alent in coding regions. 1681 Downloaded from https://academic.oup.com/mbe/article/35/7/1678/4962453 by DeepDyve user on 20 July 2022 Redmond et al. doi:10.1093/molbev/msy059 MBE Table 1. Mutation Consequences by Variant Class and Allele Frequency. Variant Consequence INDEL SNP Nonsingleton Singleton frameshift_variant 12 (1.0%) 0 (0.0%) 4 (0.2%) 8 (0.3%) inframe_deletion 87 (7.1%) 0 (0.0%) 27 (1.6%) 60 (2.4%) inframe_insertion 85 (6.9%) 0 (0.0%) 9 (0.5%) 76 (3.0%) intergenic 1025 (83.7%) 2331 (78.5%) 1415 (86.0%) 1941 (76.1%) intragenic_variant 0 (0.0%) 1 (0.0%) 0 (0.0%) 1 (0.0%) missense_variant 12 (1.0%) 445 (15.0%) 142 (8.6%) 315 (12.4%) non_coding_exon_variant 4 (0.3%) 10 (0.3%) 3 (0.2%) 11 (0.4%) stop_gained 0 (0.0%) 12 (0.4%) 3 (0.2%) 9 (0.4%) synonymous_variant 0 (0.0%) 171 (5.8%) 43 (2.6%) 128 (5.0%) Phylogeny Shows Distinct Substructure within a Single Clonal Expansion Th162.12 An “unbalanced” phylogeny can indicate the presence of Th196.12 super-spreader events or long infection chains (Colijn and Th132.11 Th074.13 Th230.12 Gardy 2014) both of which will generate more internal nodes Th106.09 that are connected to a single leaf descendent. Sackin and r = 0.653, p = 0.04 Colless indices were used to assess the phylogenies and both measures showed some elevation above expectations for a Th106.11 Th117.11 balanced tree, however, none of the clade phylogenies were Th134.11 found to be significantly unbalanced (Blum and Franc ¸ois Th086.07 2005), nor was the overall phylogeny of all three clades, and we cannot conclude that any of these phylogenies represents a super-spreader event or a single infection chain. Although all of the samples within a clade are descended from asinglecommonancestoroverthe time since thisre- cent common ancestor, we can expect some structure to have emerged within these clades. The parsimony tree (fig. 2b) for the largest clade 29 showed two distinct subclades, one consisting of three samples (subclade 29.1: Th117.11, 2000 2000 2005 2010 2015 Th134.11, Th106.11) and another larger clade with six samples Collection Date (subclade 29.2: Th162.12, Th230.12, Th196.12, Th132.11, FIG.3. Root-to-tip distance in the phylogeny of clade 29 correlates Th074.13, Th106.09), with further subdivisions in the second with sampling time. The observed correlation of genetic distance and clade. We conclude from this that the two clades represent time (via Pearson’s product moment) indicates that many of our distinct clonal expansions that have persisted separately variants are de novo and that mutation occurs at a sufficiently high within the population for at least 4 years, despite being trans- rate to resolve patterns of malaria transmission. Regression from mitted within the same geographical area (supplementary fig. sampling times indicates a common ancestor for the clade may S4, Supplementary Material online). have existed in approximately the year 2000. Clade 29 Is a Measurably Evolving Population other two clades, clade 29 was sampled over a far greater It is important to establish whether or not any of these clades period of time, consisting of samples from 2007 to 2013; represent a measurably evolving population (MEP): a popu- the MEP analysis showed an MRCA in the year 2000 which, lation in which a set of samples can be seen to have evolved although based on a small number of samples, would imply between different sampling times. In an MEP much or all of we have sampled the clade for a little over half of the time it the variation will have accrued de novo between the first and has been extant. Consequently this clade showed a much last sampling years rather than due to divergence from a stronger phylogenetic signal, with TreePuzzle analysis resolv- common ancestor before sampling began; as a result, the ing 71.4% of all quartets (supplementary fig. S5, root-to-tip distance of the resultant phylogeny should corre- Supplementary Material online). This clade is therefore suit- late with the sampling time, because samples collected later able for inference of evolutionary rates and can be used to have had more opportunity to diverge from the most recent reconstruct a transmission network. commonancestor(MRCA)(Drummond et al. 2003). The two smaller clades were not sampled over sufficient De Novo Mutations Can Reconstruct Transmission time for us to confirm that we had detected an MEP, how- ever, clade 29 showed significant correlation between root-to- Pathways Transmission networks were calculated using a minimum- tip distance and time (R¼ 0.65, P¼ 0.04) via TempEST anal- ysis (fig. 3) and was confirmed as being a MEP. Unlike the distance based method (Jombartetal. 2011) that relied on Root-tip Distance (variants) Downloaded from https://academic.oup.com/mbe/article/35/7/1678/4962453 by DeepDyve user on 20 July 2022 Disease Transmission Pathways in Clonal Malaria doi:10.1093/molbev/msy059 MBE Evolutionary Rates Indicate That De Novo Genomic Th074.13 Epidemiology Studies Are Viable in P. falciparum Evolutionary rates were calculated for all pairs within the [1.0] transmission network for clade 29 using the method of Li 0.03 et al. (1988). Highly variable evolutionary rates were seen be- Th162.12 [0.5] tween samples and a larger study would be necessary in order [1.0] 0.02 Th230.12 to derive areliableevolutionaryratefor P. falciparum. [1.0] Th086.07 Th106.09 121 [0.48] [1.0] 0.02 However, comparisons of the evolutionary rate within core 0.11 [ 0.02] Th132.11 and extended regions of the genome do show an increase in [0.41] 139 our ability to discern an MEP; the core genome acquires 0.84 0.06 Th196.12 [0.59] (61.8) mutations/month/genome while the extended regionsgaindenovovariantsatthree times thisrate2.08 [0.97] [0.03] 0.29 (61.3) muts/mth/gen (supplementary fig. S6a, [0.76] Th134.11 Supplementary Material online). The combined evolutionary 0.05 rate for all mutation classes is therefore 2.92 (62.3) muts/ [0.24] Th106.11 mth/gen (fig. 5 and supplementary fig. S6b, Supplementary [1.0] Material online). This figure is comparable to the high rate of [bootstrap] Th117.11 0.04 Distance reversion rate mutation acquisition in some RNA viruses (fig. 5 and supple- mentary fig. S7, Supplementary Material online) and indicates FIG.4. Transmission networks were estimated for clade 29 based on that direct transmissions typically acquire at least some de SNP and INDEL distances combined using a minimum distance tree novo mutations between host infections. approach. Edges are labeled with pairwise genetic distance, as well as bootstrap values (superscript in parentheses) and reversion rate (sub- Discussion script). Bootstrap values were calculated for all edges by sampling This study demonstrates for the first time the viability of using with replacement for 100 iterations and indicate strong support for de novo mutation to characterize transmission in a eukary- many of these distance-derived relationships. Potential alternative edges derived from the bootstrap results are shown in gray. Nodes otic pathogen and further shows how this technique could be in later years are shown in darker colors. Concordance between the used to study the transmission of malaria—a disease that, parsimony tree and distance based methods is notable; in both the despite a worldwide elimination campaign, still kills phylogeny and transmission network sample Th106.09 is basal to >400,000 people per year (WHO 2016). To access rapidly subclade 29.1 and Th106.11 to subclade 29.2. Reversion rates are sig- evolving low-complexity sequence, we have used long se- nificantly higher for the edge joining Th106.09 to Th106.11 support- quencing reads, PCR-free library preparation and a large k- ing the independence of subclade 29.2 mer during the assembly stage, opening up a further 14% of the P. falciparum genome to accurate genotyping. The in- Manhattan distances of both variant types combined creased coverage in turn enables more rapidly mutating var- (fig. 4). Bootstrap values were high for most relationships iants to be called than was previously possible, and the with a majority of secondary transmission paths weakly accompanying increased specificity prevents the de novo sig- supported. The distance-derived transmission path pro- nal from being lost amidst the noise of sequencing or align- vided additional support for the multiclade structure that ment error. Within a set of closely related samples from was seen in the maximum-parsimony tree results, indi- patients in Senegal (Daniels et al. 2015), we were able to cating a separate expansion of subclade 29.1 and 29.2. call 3,200 variants that successfully resolved ancestry among Both of these clades were found to be related to the basal these parasites and generated robust phylogenies for what sample Th086.07. appeared to be identical clones when assayed at lower genetic As a secondary confirmation of the network, we resolution. mapped individual variants to each vertex in the network Two of the clades we studied were sampled across only 2 and calculated the reversion rate across all primary edges years each and would not have the capacity to show the (i.e., the number of de novo mutations that would need correlation between sampling time and root-to-tip distance to be lost to support each step); we expect that variation that would suggest these mutations had arisen during the accrued on each lineage since divergence as well as gen- sampling period. On the other hand, the third clade persisted otyping errors will contribute to this statistic, however, for a total of 7 years and was determined to be a measurably these two types of “variants” would be indistinguishable. evolving population. This level of resolution is a first for a The majority of edges showed reversion rates of 11% or eukaryotic pathogen over time scales relevant to disease less, however, the edge connecting subclade 29.2 to the transmission (Biek et al. 2015) and supports the use of de rest of the samples (Th106.09 through Th106.11) showed novo variants to examine the phylodynamics of malaria in the a markedly higher reversion rate (0.29). This discrepancy future. indicates a lack of sampling depth in the early years of the To demonstrate the utility of this approach, a transmission expansion and shows that we have not captured a true network was inferred for the largest clade using a minimum- progenitor of subclade 29.2. distance approach (fig. 4; Jombartetal. 2011). For 1683 Downloaded from https://academic.oup.com/mbe/article/35/7/1678/4962453 by DeepDyve user on 20 July 2022 Redmond et al. doi:10.1093/molbev/msy059 MBE 1.4 wks 1.5 wks 1.6 wks 1.9 wks RNA virus Bacteria Eukaryote 2.4 wks core extended 3.9 wks 10 1.3 mths 3.3 yrs FIG.5. Lower mutation rates and the restricted genome size of the core genome would generate a single new mutation on average every month. The increase in accessibility offered by 250-bp reads and DISCOVAR increases both genome size and measurable evolutionary rate, resulting in a new variant every 2.4 weeks in the low complexity genome and every 1.5 weeks overall. This makes P. falciparum comparable to other infectious agents like influenza or HIV where transmission networks may be informed by genome sequencing. Mutation rates derived from Biek et al. (2015). characterizing malaria transmission, de novo variation repre- (Th132.11, Th196.12, and Th230.12). It is important to note sents a significant advance over standing variation, which that this conclusion would have been impossible to draw would be unable to identify directional relationships between from a combination of standing variation and epidemiolog- infections. ical data: although all six of these samples were found within Evolutionary ratesascalculatedacrossthe third-clade the Thie `s region, neither clade separates geographically and transmission network imply one new mutation is acquired individual members from separate clades were found in close every 1.5 weeks (fig. 5), meaning that at least one new muta- proximity (supplementary fig. S4, Supplementary Material tion should be found between direct transmissions. Although online). Perhaps more remarkably, despite 70% of the samples an- detecting a single mutation between individuals without any genotyping errors in a 23 Mb genome might be challenging alyzed here having been taken in the intervening years, sam- with current sequencing technologies, this does indicate that ple Th074.13 derives from Th106.09, and is entirely sufficient de novo mutations to reconstruct transmission independent of the other two clades. This means that at pathways can be expected in genomic epidemiology studies the beginning of the dry season in 2011 (when transmission would be expected to tail off) at least three separate clades of that were conducted over 2 years ormore—timesscalesthat are highly relevant to malaria transmission. this particular strain were circulating in the Thie `s region and Although the transmission network was generated as a all survived to subsequent wet seasons. Maintenance of these “proof-of-principle”, low rates of reversion indicate that, parasite types suggests a sizable asymptomatic reservoir of even if the number of intervening generations remains unre- infection may have been present. solved, many of the relationships identified are accurate and Intriguingly, the smallest of those clades consists of just one sample Th074.13; this sample came from a village 16 km out- some firm conclusions can be drawn from the network about the nature of transmission within Senegal. side of Thie `s and is separated from its closest node in the We can eliminate the possibility that these clonal parasites network by 6 years (supplemenatary fig. S4, Supplementary persist through a single chain of infections: an infection chain Material online), yet it is separated from this parasite by just would generate a tree with each node joined only to its de- 88 variants—the shortest genetic distance in our data set scendent (Colijn and Gardy 2014), yet both the tree and found over the longest intervening time and distance. As a transmission network show strong internal structure with result of the large population sizes that P. falciparum reaches well supported distinct clades; in particular, the clade of three during the erythrocytic life stages we do not expect de novo samples in 2011 (Th106.11, Th117.11, and Th134.11) is not mutations to be commonly fixed within the host, but instead directly related to the other clade found from 2011 to 2012 expect that differences in the consensus sequences of related EBOV HIV−M HFLUV H. pylori S. aureus M. tuberculosis P. falciparum evol. rate (muts / genome / year) Downloaded from https://academic.oup.com/mbe/article/35/7/1678/4962453 by DeepDyve user on 20 July 2022 Disease Transmission Pathways in Clonal Malaria doi:10.1093/molbev/msy059 MBE infections will only be fixed by passage through population In summary,wehaveshown aeukaryoticparasitetobe a bottlenecks during mosquito transmission. The lower evolu- measurably evolving population across time scales that are tionary rate found between these two samples could be a relevant to disease transmission, and we have calculated evo- result of long-term asymptomatic infections during which no lutionary rates that indicate de novo mutations could be new variants would expect to be fixed. Further investigations found even between direct infections. This type of genetic of the relationship between evolutionary rate and serial in- signal is qualitatively different to those that have been avail- terval would be of great interest and may ultimately allow us able in previous genomic epidemiology studies in malaria and to use genetic distance to infer the duration of asymptomatic will enable finer grained characterization of disease transmis- infections and their contribution to local disease incidence. sion. Though this data set is relatively small, we are limited by Although calculations of TMRCA are difficult from such a the requirement to use only clonal outbreaks for the network small sample set, the placement of the basal node in the year inference and challenges remain before these techniques can 2000, the diversity contained within the phylogeny, and the be used in larger data sets from higher-transmission settings, number of persisting clades all suggest that this clonal lineage where sexual recombination will add complexity to the re- has been long-lasting and did not emerge with sample construction. We hope that future data sets revealing de Th086.07 in 2007. novo mutations in larger collections of patient samples will Even with this limited data set we have successfully derived motivate the development of more sophisticated methods conclusions about the patterns of disease transmission in this for transmission-network reconstruction; methods that could region, but the capacity to easily derive transmission networks work on recombined samples or could tolerate a lower degree could be transformative for larger data sets, particularly in of genotyping specificity. Nevertheless, there are many regions preelimination settings. Reiner et al. (2015) have demon- in which the techniques employed here are of immediate use. strated the capacity for network reconstruction to enable As of the end of 2015, the WHO has certified 27 countries as assessments of “malariogenic potential” (a combined mea- malaria-free, three more countries are in the process of cer- sure of disease importation rates and capacity for local trans- tification (having had 3 years without indigenous transmis- mission) in Swaziland, a country nearing elimination. The sion), and a further ten countries have had at least 1 year network generated by Reiner et al. is derived from probabil- istic modeling of transmission based upon the times and without indigenous transmission (WHO 2016). In any of these geographic locations of the infections and would enable pub- settings there could be outbreaks that are partially or entirely lic health authorities to target interventions such as focal clonal. Reconstruction of a transmission network in the man- mass drug administration (FMDA) where they are most nerthatwehaveachievedhere would be able to detect any needed. It is worth noting that our network inference indi- links that were entirely within-country, thus distinguishing cates that some samples will confound expectations of spa- imported from indigenous transmission. These data could tiotemporal proximity, and validation with genetic data also identify “super-spreaders” that might be targeted with wouldbeavaluableadditiontosuchmapping attempts. focal interventions. The capacity to distinguish between in- In other pathogens network reconstruction using de novo digenous and imported transmission or to pinpoint hotspots variation has already led to actionable results for infectious- in countries nearing elimination could be key capacities for disease control and these situations could apply for malaria. malaria elimination efforts. There are now numerous cases in which whole-genome se- quencing and network reconstruction have been deployed in Materials and Methods real time to identify the source of a hospital-acquired infec- tion leading directly to the identification of reservoirs of per- Lab Strain Sample Preparation sistent disease transmission (Halachev et al. 2014; Quick et al. DISCOVAR libraries were prepared for two Dd2 lines (Dd2- 2015). Even on a wider geographic scale, where comprehen- 2D4 and Dd2-B2) and one 3D7 line (3D7-A-10). The Dd2-2D4 sive sampling of an outbreak cannot be guaranteed, the ap- line is derived from clone MRA-156 (originally deposited by proach demonstrated here can generate concrete Thomas E. Wellems and available from BEI resources, NIAID, recommendations for public health authorities. NIH https://www.beiresources.org/Catalog/ Transmission-network reconstructions of a tuberculosis out- BEIParasiticProtozoa/MRA-156.aspx) whereas clones Dd2-B2 break in Canada were able to show that later cases were and 3D7-A10 were derived from the MALDA/MDTIP consor- progressions from latent to active disease, rather than active tium (Cowell et al. 2018) and are available on request. transmission, with the direct result that the outbreak was Although the Dd2 lines derive from a common subcloned declared over, a conclusion that was reached solely by com- patient isolate, small numbers of de novo variants are parison of the rate of de novo mutation between samples expected to have been acquired via mutation and genetic (Hatherell et al. 2016). Adapting such a capacity to malaria drift in the years since the laboratory lines diverged. Three would have major implications for elimination efforts: malaria biological replicates (using separate isolates of DNA) were elimination within a country is defined as zero incidence of prepared for the Dd2-B2 and 3D7-A10 samples, and one li- indigenous cases (WHO 2016) and the ability to distinguish a brary of the Dd2-2D4. All libraries were sequenced in dupli- chain of transmission from repeated importations without cate sequencing lanes allowing us to remove individual recourse to a multicountry data set could be a key capacity for the WHO Global Malaria Program. miscalled bases where they were not consistent across lanes. 1685 Downloaded from https://academic.oup.com/mbe/article/35/7/1678/4962453 by DeepDyve user on 20 July 2022 Redmond et al. doi:10.1093/molbev/msy059 MBE Senegal Sample Preparation from the original 250 bp reads. This was followed by quality Samples used in this study derive from prior collections score recalibration using GATK VQSR (DePristo et al. 2011)— (Daniels et al. 2013, 2015)inThie `s, Senegal. All samples filtering by VQSLOD score with a 90% retention threshold. were collected from individuals after informed consent of The VQSR truth-set consisted of variant loci that had been either the subject or a parent/guardian. This protocol was previously determined in the “Pf-Crosses” data set generated reviewed and approved by the ethical committees of the by Miles et al. (2016).Thisdataset consists of aseriesof Senegal Ministry of Health (Senegal) and the Harvard T. H. genetic crosses of P. falciparum clones (3D7  HB3, HB3 Chan School of Public Health (16330, 2008) as detailed in Dd2, and 7G8  GB4) in which parents and all offspring were Daniels et al. (2015). Samples were collected passively from deeply sequenced, enabling variant calls to be filtered accord- patients reporting to the “Service de Lutte Anti-Parasitaire” ing to conformity with Mendelian inheritance expectations. (SLAP) clinic for suspected uncomplicated malaria between approximately August and December each year. Patients with Variant Validation acute fevers or history of fever within the past 24 h of visiting The P. falciparum genome presents a challenge to many ge- the clinic and with no reported history of antimalarial use notype callers due to extensive repeat structure and recom- were considered; they were diagnosed with malaria based on bination within some large gene families (Bopp et al. 2013), as microscopic examination of thick and thin blood smears and well as the high A/T nucleotide composition of the intergenic rapid diagnostic tests (RDT), as available. All samples were regions (Gardner et al. 2002). Moreover, validation of variant collected via venous blood draw and stored as glycerolyte calls in low-complexity regions of the P. falciparum genome stocks, and an aliquot of material collected on Whatman filter via conventional PCR and Sanger dideoxy sequencing is dif- paper was extracted for nucleic acid material and genotyped ficult, as PCR amplicons evaluated were generally too repet- via 24-SNP barcode as described in Daniels et al. (2013).From itivetoyield interpretablesequencetraces. As aresult, atwo- this larger data set, 18 samples were chosen representing stage concordance-based approach to genotyping was taken: single infections from clades 24, 26, and 29, (3, 5, and 10 1) the “accessible” genome was determined separately for samples, respectively) based on identical genotypes observed each genotype caller; 2) genotypesinthe regionsaccessible using the 24-SNP barcode. The 18 parasite samples were to both callers were compared with a known sample and culture-adapted to produce parasite DNA free of host (hu- assessed for specificity. man) contamination. Culture times were kept to a minimum Accessible Genome Determination. Miles et al. used the Pf- (<15 cycles) to preserve the integrity of the original patient Crosses data set in order to characterize the “accessible material, however, the acquisition of confounding de novo genome”—the region in which genotyping calls could be re- mutations within this culturing stage cannot be ruled out. liably made (Miles et al. 2016). The authors found 90% of the genome to be within an acceptable error rate, and only hy- Sequencing and Variant Discovery pervariable regions, centromeres and telomeres were ex- DNA extracted from all samples was size selected to give a cluded. DISCOVAR generates few variables upon which mean fragment length of 450 bp and sequenced using an variant recalibration could be performed; therefore, we took Illumina HiSeq 2500 with a read length of 250 bp, ensuring a stricter approach. Using in silico sequence generated from a an overlap of approximately 50 bp. Sequence was aligned to second genome assembly, we determined a region to be ac- the Pf3D7_v3 reference using BWA-mem v. 0.7.12 (Li 2014). cessible only if the variant caller can successfully reconstruct All alignments were deposited at the NCBI Short Read the region around all variants within it. Three variant calling Archive (SRA) database (accession: SRP137271). approaches were assessed: DISCOVAR using 250 bp reads, Variant discovery using DISCOVAR (release no. r52488) HaplotypeCaller (DePristo et al. 2011) with 250 bp reads (Weisenfeld et al. 2014) was performed with 250 bp reads (HC250) and HaplotypeCaller with 100 bp reads (HC100). on 450 bp fragments. The Pf3D7 genome was subdivided An artificial (in silico) Dd2-2D4 read set (250 bp fragment into 30 kb regionsand DISCOVAR wasrun separately foreach size, 450 bp insert size) was generated from a PacBio assembly region. In regions where it was not possible to construct a of the Dd2-2D4 line of P. falciparum (ftp://ftp.sanger.ac.uk/ single assembly graph for variant calling, DISCOVAR was pub/project/pathogens/Plasmodium/falciparum/DD2/) us- rerun with progressively smaller regions (10, 5, and 2 kb) of ing wgsim v0.3.2 (Li et al. 2009) using default parameters the genome. In all cases a 1 kb overlap was allowed to control and generating 100X coverage. This read set was aligned to for edge effects, though variants were only included up to the the standard Pf3D7_v3 reference assembly and genotypes inner edge of this flanking region. All libraries were sequenced were called using DISCOVAR and HaplotypeCaller as detailed and genotyped in duplicate to remove individual miscalled above. For each resulting variant, the putative Dd2-2D4 se- bases derived from sequencing error. Resulting genotypes quence in a 200 bp region around the variant was recon- were filtered based on Phred score (i.e., no call above 23 or structed and realigned to the Dd2-2D4 PacBio reference via any below ten included) and hypervariable sites (those for Smith–Waterman alignment (BWA-SW [Li and Durbin which the genotyping software generated >50 potential al- ternative variants) were removed. 2010]). Any differences between this locally reconstructed Variant discovery using GATK HaplotypeCaller was per- Dd2-2D4 sequence and the Dd2-2D4 PacBio assembly (i.e., formed with both 250 bp reads and a more typical data set the Levenshtein distance between the two strings) would (100 bp reads with a 300 bp insert size) that was generated therefore represent errors derived either from misalignment 1686 Downloaded from https://academic.oup.com/mbe/article/35/7/1678/4962453 by DeepDyve user on 20 July 2022 Disease Transmission Pathways in Clonal Malaria doi:10.1093/molbev/msy059 MBE of the simulated reads or from the genotyping algorithm Temporal data were assessed by examining the correlation itself. This Smith–Waterman approach was used to compare between sampling time and the root-to-tip distance of a calls made using DISCOVAR using 250 bp reads, “non-clock” phylogenetic tree (in this case one derived by HaplotypeCaller using the same 250 bp reads, and via neighbor-joining) using TempEST v1.5 (Rambautetal. HaplotypeCaller using 100 bp reads. For each variant calling 2016). Correlation between genetic distance and time was method, the “accessible genome” was defined as those assessed via Pearson’s product moment. Transmission net- regions where alignment depth was within one standard de- works were reconstructed using SeqTrack within adegenet viation of the mean and where the Levenshtein distance (LD) (Jombartetal. 2011) based on the Manhattan distance of was zero (i.e., where reconstruction of Dd2-2D4 had been all variants: an approach that has been shown to outperform error free). Low complexity regions of the genome were parsimony based methods where sampling density is low assessed by running Dustmasker (version 1.0 in package (Worby et al. 2017). Bootstrapping of the transmission net- BLASTþ 2.2.30 using default parameters). work was performed by sampling all variants with replace- ment for 100 iterations; bootstrap values are reported as the proportion of bootstrapped networks supporting each join. Accessible Marker Validation An experimentally validated set of genotypes was then used Evolutionary Rate Calculation for confirmation of variants within those regions of the ge- For samples sharing very recent common ancestry, pairwise nome that were accessible to both HaplotypeCaller and variant call differences are expected to be dominated by se- DISCOVAR. This biological validation was performed using quencing error, leading to overestimation of the rate at which three laboratory lines of P. falciparum,eachsourced from de novo mutations accumulate. Where a transmission net- different laboratories and available on request. The called work could be resolved for a clade, evolutionary rates were variants were then compared with the set of previously val- therefore calculated for each IBD pair as the difference be- idated “Pf-Crosses” variants using VCFEval (Cleary et al. 2015): tween the distance of each sample to an outgroup (chosen receiver-operator curves (ROC curves—i.e., plots of true pos- from one of the other two clades) to account for ancestral itive and false positive rates at equal quality scores) and spe- diversity (Li et al. 1988). This was performed separately for the cificity/sensitivity values were calculated in the regions that regions of the genome determined to be callable with had been previously determined to be accessible by both DISCOVAR, HC250, and HC100 and the DISCOVAR-only ge- genotype callers and in the regions determined to be acces- nome yielding an estimate of the contribution to the evolu- sible only to DISCOVAR. Pf-Crosses data is available from: tionary rate of “core” and “extended” regions of the genome. https://www.malariagen.net/data/pf-crosses-1.0 The mean evolutionary rate is reported for each pair of sam- ples using all potential outgroups. Phylogeny and Transmission Networks SNP and INDEL distance matrices were calculated based on Supplementary Material the number of mutations between each sample pair, without Supplementary data are available at Molecular Biology and making adjustments based on putative mutation rate or STR length (all INDELs were treated as a single mutation regardless Evolution online. of length or whether they were found within a known STR). Acknowledgments Parsimony trees were generated using the R package We are grateful for the assistance of Thomas D. Otto (Sanger Phangorn (Schliep 2011) and neighbor-joining trees were gen- Institute) for providing the Dd2_v1 reference genome for erated using the R package “ape” (Paradis et al. 2004). Phylogenies were calculated by both methods for the pairwise validation. BEI resources, David Fidock (Columbia SNP distance, INDEL distance and the Manhattan distance of University), and Daniel Goldberg (Washington University in St. Louis) for supplying their Plasmodium stocks for sequenc- both SNPs and INDELs. Bootstrapping was performed by ing. We thank the UCAD/LeDantec laboratory and SLAP “sampling with replacement” for 100 replicates across each phylogeny. Bootstrap support is reported as the proportion of clinic team, particularly Dr. Ngayo Sy, and the patients for trees supporting each node. clinical samples. We thank Sin ead Chapman, Caroline Cusick, To assess the phylogenetic signal within the clades, the and Jim Bochicchio for project management, and Steve Schaffner and Angela Early for their insightful comments on Tree-Puzzle program (Schmidtetal. 2002) was used to ana- lyze all potential sequence quartets within the set. For any set the manuscript. Funding for this work at the Harvard T.H. of four individuals, a maximum of three phylogenies are pos- Chan School of Public Health and the Broad Institute was sible, and the TreePuzzle method randomly selects individuals provided by a grant to D.F. Wirth from the Global Health from the data set and assesses by maximum likelihood if it Program at the Bill and Melinda Gates Foundation accords with any of the three phylogenies (see Strimmer et al. (OPP1053604, “Genomic-Based Diagnostics for Elimination for details [Strimmer and von Haeseler 1996]). A quartet with and Eradication of Plasmodium”); publication was paid for clean phylogenetic signal will show high likelihood to one of via the Chronos system. Senegal sample collection was the tree topologies, a network with significant introgression funded by the NIAID (R01AI099105: “Genetic Variation and will show likelihood to more than one, and an unresolved Evolution of Artemisinin Resistance.”). This work was carried star-like phylogeny will show likelihood to none. out with support from the ExxonMobil Foundation. 1687 Downloaded from https://academic.oup.com/mbe/article/35/7/1678/4962453 by DeepDyve user on 20 July 2022 Redmond et al. doi:10.1093/molbev/msy059 MBE elucidates Ebola virus origin and transmission during the 2014 out- References break. Science 345(6202): 1369–1372. Azarian T, Maraqa NF, Cook RL, Johnson JA, Bailey C, Wheeler S, Nolan Halachev MR, Chan JZ-M, Constantinidou CI, Cumley N, Bradley C, D, Rathore MH, Morris JG, Salemi M. 2016. Genomic epidemiology Smith-Banks M, Oppenheim B, Pallen MJ. 2014. Genomic epidemi- of methicillin-resistant Staphylococcus aureus in a neonatal intensive ology of a protracted hospital outbreak caused by multidrug- care unit. PLoS One 11:e0164397. resistant Acinetobacter baumanniiin Birmingham, England. Biek R, Pybus OG, Lloyd-Smith JO, Didelot X. 2015. Measurably evolving Genome Med.6(11):70. pathogens in the genomic era. Trends Ecol Evol. 30(6): 306–313. Harris SR, Feil EJ, Holden MTG, Quail MA, Nickerson EK, Chantratita N, Blum MGB, Franc ¸ois O. 2005. On statistical tests of phylogenetic tree Gardete S, Tavares A, Day N, Lindsay JA, et al. 2010. Evolution of imbalance: the Sackin and other indices revisited. Math Biosci. MRSA during hospital transmission and intercontinental spread. 195(2): 141–153. Science 327(5964): 469–474. Bopp SER, Manary MJ, Bright AT, Johnston GL, Dharia NV, Luna FL, Hatherell H-A, Didelot X, Pollock SL, Tang P, Crisan A, Johnston JC, Colijn McCormack S, Plouffe D, McNamara CW, Walker JR, et al. 2013. C, Gardy J. 2016. Declaring a tuberculosis outbreak over with Mitotic evolution of Plasmodium falciparum shows a stable core genomic epidemiology. Microb Genomics 2(5):e000060. genome but recombination in antigen families. PLoS Genet. 9(2): Jombart T, Eggo RM, Dodd PJ, Balloux F. 2011. Reconstructing disease e1003293. outbreaks from genetic data: a graph approach. Heredity 106(2): Chenet SM, Taylor JE, Blair S, Zuluaga L, Escalante AA. 2015. Longitudinal 383–390. analysis of Plasmodium falciparum genetic variation in Turbo, Lemieux J. 2015. Genomic analysis of evolution in Plasmodium falcipa- Colombia: implications for malaria control and elimination. Malar rum and Babesia microti. Doctoral Thesis: http://nrs.harvard.edu/ J. 14:363. urn-3:HUL.InstRepos:15821587. Cleary JG, Braithwaite R, Gaastra K, Hilbush BS, Inglis S, Irvine SA, Jackson Lento GM, Hickson RE, Chambers GK, Penny D. 1995. Use of spectral A, Littin R, Rathod M, Ware D, et al. 2015. Comparing variant call files analysis to test hypotheses on the origin of pinnipeds. MolBiolEvol. for performance benchmarking of next-generation sequencing var- 12(1): 28–52. iant calling pipelines. bioRxiv. doi: https://doi.org/10.1101/023754 Lenz C, Haerty W, Golding GB. 2014. Increased substitution rates sur- Colijn C, Gardy J. 2014. Phylogenetic tree shapes resolve disease trans- rounding low-complexity regions within primate proteins. Genome mission patterns. Evol Med Public Health 2014(1): 96–108. Biol Evol. 6(3): 655–665. Cowell AN, Istvan ES, Lukens AK, Gomez-Lorenzo MG, Vanaerschot M, Li H. 2014. Toward better understanding of artifacts in variant call- Sakata-Kato T, Flannery EL, Magistrado P, Owen E, Abraham M, et al. ing from high-coverage samples. Bioinformatics 30(20): 2018. Mapping the malaria parasite druggable genome by using 2843–2851. in vitro evolution and chemogenomics. Science 359(6372): 191–199. Li H, Durbin R. 2010. Fast and accurate long-read alignment with Daniels R, Chang H-H, Sene PD, Park DC, Neafsey DE, Schaffner SF, Burrows-Wheeler transform. Bioinformatics 26(5): 589–595. Hamilton EJ, Lukens AK, Van Tyne D, Mboup S, et al. 2013. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Genetic surveillance detects both clonal and epidemic transmission Abecasis G, Durbin R. 2009. The sequence alignment/map format of malaria following enhanced intervention in Senegal. PLoS One and SAMtools. Bioinformatics 25(16): 2078–2079. 8:e60780. Li W-H, Tanimura M, Sharp2 PM. 1988. Rates and Dates of Daniels RF, Schaffner SF, Wenger EA, Proctor JL, Chang H-H, Wong W, Divergence between AIDS Virus Nucleotide Sequences’. Mol Biol Baro N, Ndiaye D, Fall FB, Ndiop M, et al. 2015. Modeling malaria Evol. 5:313–330. genomics reveals transmission decline and rebound in Senegal. Proc Metsky HC, Matranga CB, Wohl S, Schaffner SF, Freije CA, Winnicki Natl Acad Sci U S A. 112(22): 7067–7072. SM, West K, Qu J, Baniecki ML, Gladden-Young A, et al. 2017. DePristo MA, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C, Zika virus evolution and spread in the Americas. Nature Philippakis AA, del Angel G, Rivas MA, Hanna M, et al. 2011. A 546(7658): 411–415. framework for variation discovery and genotyping using next- Miles A, Iqbal Z, Vauterin P, Pearson R, Campino S, Theron M, Gould K, generation DNA sequencing data. Nat Genet. 43(5): 491–498. Mead D, Drury E, O’Brien J, et al. 2016. Indels, structural variation, Drummond AJ, Pybus OG, Rambaut A, Forsberg R, Rodrigo AG. 2003. and recombination drive genomic diversity in Plasmodium falcipa- Measurably evolving populations. Trends Ecol Evol. 18(9): 481–488. rum. Genome Res. 26(9): 1288–1299. Eyre DW, Cule ML, Wilson DJ, Griffiths D, Vaughan A, O’Connor L, Ip Morgulis A, Gertz EM, Sch€ affer AA, Agarwala R. 2006. A fast and sym- CLC, Golubchik T, Batty EM, Finney JM, et al. 2013. Diverse sources of metric DUST implementation to mask low-complexity DNA C. difficile infection identified on whole-genome sequencing. NEnglJ sequences. JComputBiol. 13(5): 1028–1040. Med. 369(13): 1195–1205. Narzisi G, Schatz MC. 2015. The challenge of small-scale repeats for indel FariaNR, QuickJ,Morales I, ThezeJ,deJesus JG,GiovanettiM,Kraemer discovery. Front Bioeng Biotechnol.3:8. MUG, Hill SC, Black A, da Costa AC, et al. 2017. Epidemic establish- Neher RA, Bedford T. 2015. nextflu: real-time tracking of seasonal ment and cryptic transmission of Zika virus in Brazil and the influenza virus evolution in humans. Bioinformatics 31(21): Americas. Nature 546(7658): 406–410. 3546–3548. Flueck C, Bartfai R, Volz J, Niederwieser I, Salcedo-Amaya AM, Alako BTF, Nkhoma SC, Nair S, Al-Saai S, Ashley E, McGready R, Phyo AP, Nosten F, Ehlgen F, Ralph SA, Cowman AF, Bozdech Z, et al. 2009. Plasmodium Anderson TJC. 2013. Population genetic correlates of declining trans- falciparum heterochromatin protein 1 marks genomic loci linked to mission in a human pathogen. Mol. Ecol. 22(2): 273–285. phenotypic variation of exported virulence factors. PLoS Pathog. 5(9): Paradis E, Claude J, Strimmer K. 2004. APE: analyses of phylogenetics and e1000569. evolution in R language. Bioinformatics 20: 289–290. Galinsky K, Valim C, Salmier A, de Thoisy B, Musset L, Legrand E, Faust A, Park DJ, Dudas G, Wohl S, Goba A, Whitmer SLM, Andersen KG, Sealfon Baniecki ML, Ndiaye D, Daniels RF, et al. 2015. COIL: a methodology RS,Ladner JT, Kugelman JR,MatrangaCB, et al.2015. Ebola virus for evaluating malarial complexity of infection using likelihood from epidemiology, transmission, and evolution during seven months in single nucleotide polymorphism data. Malar J. 14:4. Sierra Leone. Cell 161(7): 1516–1526. Gardner MJ, Hall N, Fung E, White O, Berriman M, Hyman RW, Carlton Popovich KJ, Snitkin E, Green SJ, Aroutcheva A, Hayden MK, Hota B, JM, Pain A, Nelson KE, Bowman S, et al. 2002. Genome sequence of Weinstein RA. 2016. Genomic epidemiology of USA300 methicillin- the human malaria parasite Plasmodium falciparum. Nature resistant Staphylococcus aureus in an urban community. Clin Infect 419(6906): 498–511. Dis. 62(1): 37–44. Gire SK,Goba A, Andersen KG,Sealfon RSG, Park DJ,KannehL,JallohS, Quick J, Ashton P, Calus S, Chatt C, Gossain S, Hawker J, Nair S, Neal K, Momoh M, Fullah M, Dudas G, et al. 2014. Genomic surveillance NyeK,PetersT,etal. 2015. Rapid draft sequencing and real-time 1688 Downloaded from https://academic.oup.com/mbe/article/35/7/1678/4962453 by DeepDyve user on 20 July 2022 Disease Transmission Pathways in Clonal Malaria doi:10.1093/molbev/msy059 MBE nanopore sequencing in a hospital outbreak of Salmonella. Genome Strimmer K, von Haeseler a. 1996. Quartet puzzling: a quartet maximum- Biol. 16(1): 114. likelihood method for reconstructing tree topologies. Mol Biol Evol. Rambaut A, Lam TT, Max Carvalho L, Pybus OG. 2016. Exploring the 13(7): 964–969. temporal structure of heterochronous sequences using TempEst Weisenfeld NI, Yin S, Sharpe T, Lau B, Hegarty R, Holmes L, Sogoloff B, (formerly Path-O-Gen). Virus Evol. 2(1): vew007. Tabbaa D, Williams L, Russ C, et al. 2014. Comprehensive variation Reiner RC, Le Menach A, Kunene S, Ntshalintshali N, Hsiang MS, discovery in single human genomes. Nat Genet. 46(12): 1350–1355. Perkins TA, Greenhouse B, Tatem AJ, Cohen JM, Smith DL. 2015. WHO. 2016. WHO j World Malaria Report 2016. Mapping residual transmission for malaria elimination. Elife Worby CJ, Lipsitch M, Hanage WP. 2017. Shared genomic variants: iden- 4:e09520. tification of transmission routes using pathogen deep sequence data. Schliep KP. 2011. phangorn: phylogenetic analysis in R. Bioinformatics Am.J.Epidemiol. 186(10): 1209–1216. 27(4): 592–593. Zilversmit MM, Volkman SK, Depristo MA, Wirth DF, Awadalla P, Hartl Schmidt H. a, Strimmer K, Vingron M, von Haeseler A. 2002. TREE- DL. 2010. Low-complexity regions in Plasmodium falciparum: miss- PUZZLE: maximum likelihood phylogenetic analysis using quartets ing links in the evolution of an extreme genome. MolBiolEvol.27(9): and parallel computing. Bioinformatics 18(3): 502–504. 2198–2209. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Molecular Biology and Evolution Oxford University Press

Loading next page...
 
/lp/ou_press/de-novo-mutations-resolve-disease-transmission-pathways-in-clonal-hj4pCf8j2T

References (51)

Publisher
Oxford University Press
Copyright
Copyright © 2022 Society for Molecular Biology and Evolution
ISSN
0737-4038
eISSN
1537-1719
DOI
10.1093/molbev/msy059
Publisher site
See Article on Publisher Site

Abstract

Detecting de novo mutations in viral and bacterial pathogens enables researchers to reconstruct detailed networks of disease transmission and is a key technique in genomic epidemiology. However, these techniques have not yet been applied to the malaria parasite, Plasmodium falciparum, in which a larger genome, slower generation times, and a complex life cycle make them difficult to implement. Here, we demonstrate the viability of de novo mutation studies in P. falciparum for the first time. Using a combination of sequencing, library preparation, and genotyping methods that have been optimized for accuracy in low-complexity genomic regions, we have detected de novo mutations that distinguish nominally identical parasites from clonal lineages. Despite its slower evolutionary rate compared with bacterial or viral species, de novo mutation can be detected in P. falciparum across timescales of just 1–2 years and evolutionary rates in low-complexity regions of the genome can be up to twice that detected in the rest of the genome. The increased mutation rate allows the identification of separate clade expansions that cannot be found using previous genomic epidemiology approaches and could be a crucial tool for mapping residual transmission patterns in disease elimination campaigns and reintroduction scenarios. Key words: genomic epidemiology, de novo mutation, malaria, transmission networks. Despite the relevance of all these scenarios to malaria Introduction transmission and the potential impact on elimination efforts, The reconstruction of transmission networks by sequencing inference of transmission networks based on de novo varia- pathogen samples is a powerful technique that has become tion has not been attempted. Genomics-based transmission increasingly routine in the field of genomic epidemiology. By analyses in malaria has have thus far relied on standing var- identifying infections with shared de novo variation or the iation (Nkhoma et al. 2013; Daniels et al. 2015). However, shortest genetic distances between samples, epidemiologists these approaches require the reassortment of common alleles are able to reconstruct detailed transmission networks in viral via sexual outcrossing and will not show discriminatory and bacterial pathogens such as flu (Neher and Bedford 2015), power in regions where transmission rates are too low (in MRSA (Harris et al. 2010; Popovich et al. 2016)orEbola (Gire which the parasite predominantly mates with itself) or in et al. 2014; Park et al. 2015). Assaying de novo mutations has situations where malaria has been reintroduced to a region proven to be a useful strategy in settings where epidemiolog- (resulting in an entirely clonal outbreak). Indeed, even within ical data is limited or in situations where the actual transmis- regions with ongoing outcrossing, clonal outbreaks persisting sion pathways are difficult to identify, and it is applicable to across multiple years have been found (Daniels et al. 2015); as any measurably evolving population (as defined by Drummond disease-endemic countries approach elimination these situa- et al. 2003). Using de novo genetic variation to track pathogen tions are likely to become more common. population movement has enabled the identification of dis- Several challenges prevent the direct application of trans- ease reservoirs underlying reemergence (Eyre et al. 2013); mission network reconstruction methodologies in assessed the relative contribution of local versus imported Plasmodium. In addition to complications introduced by sex- infections in a given setting (Azarian et al. 2016); and allowed ual outcrossing, the larger genome of P. falciparum and a slow the identification of the rate and times of transmission across mutation rate relative to prokaryotic or viral pathogens national borders (Faria et al. 2017; Metsky et al. 2017). reduces our ability to accurately resolve de novo mutations The Author(s) 2018. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/ licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is Open Access properly cited. 1678 Mol. Biol. Evol. 35(7):1678–1689 doi:10.1093/molbev/msy059 Advance Access publication May 1, 2018 Downloaded from https://academic.oup.com/mbe/article/35/7/1678/4962453 by DeepDyve user on 20 July 2022 Disease Transmission Pathways in Clonal Malaria doi:10.1093/molbev/msy059 MBE between individuals. The most rapidly mutating loci are clones, each with an associated genome assembly. These found within low complexity sequence, and commonly techniques were then applied to patient samples collected within short tandem repeats (STRs). Mutation rates for from a single clinic in Thie `s, Senegal representing three sep- STRs can be as high as 3.77 10 mutations/locus/genera- arate clonal expansions circulating in the region. These clonal tion (Chenet et al. 2015), many orders of magnitude higher samples were genotyped using a 24 SNP barcode (Daniels than the estimated rate for SNPs in non-STR sequence (1.07 et al. 2015) and were previously determined to be identical 10 substitutions/locus/generation [Lemieux 2015]). STR by fixed locus genotyping of commonly genotyped polymor- mutations are themselves associated with increased mutation phic genes (MSP1/MSP2/TRAP/CSP;A. Bei, personal rates in other organisms (Lenz et al. 2014) and low-complexity communication). sequences in P. falciparum show greatly increased evolution- 250-bp Illumina Reads Increase Genotype Specificity ary rates compared with high-complexity flanking sequences in Low-Complexity Regions (Zilversmit et al. 2010). These highly mutable, low complexity regions within the P. Genome Accessibility falciparum genome are difficult to access using conventional The effect of read length and calling algorithm on genome short-read sequencing approaches. Commonly used sequenc- accessibility was assessed by comparison of DISCOVAR ing formats (100 bp paired-end reads within 300 bp frag- (which is restricted to 250 bp reads) and GATK ments) prohibit high quality alignments in repetitive HaplotypeCaller with both 100 bp (hereafter HC100) and sequence, particularly when the size of the repeat region 250 bp reads (HC250). To assess genome accessibility we gen- approaches or exceeds the read length. Where insertion/de- erated an in silico library from one genome assembly (Dd2) letion (INDEL) mutations are found within STRs they can be aligned this to another referencegenomeassembly (3D7)and impossible to detect, as insufficient nonrepetitive flanking assessed our ability to accurately reconstruct the second ref- sequence around the variant can lead to incorrect partial erence genome. A greater proportion of the P. falciparum alignments or a complete failure to map reads (Narzisi and genome was found to be “accessible” to DISCOVAR Schatz 2015). (78.4%) and HC250 (79.3%), than was accessible to HC100 By spanning long repeats, longer read lengths will greatly (64%), indicating the importance of longer reads. This differ- improve our ability to analyze variation in low complexity ence in accessibility represents >3.5 Mb: the majority of the genomic regions (including SNPs, STR modifications and genome outside of the mitotically recombining telomere and non-STR associated INDELs) and may, along with associated subtelomere sequence and intercalary heterochromatin algorithmic advances in variant calling, enable studies of de (Flueck et al. 2009;see fig. 1). As expected, low-complexity novo variation in eukaryotes (Weisenfeld et al. 2014). regions also show a marked improvement in accessibility; of 10.8 Mb of sequence identified as low-complexity by Applying such an approach to clinical samples representing clonal expansions in Senegal, we show that de novo mutation DustMasker (Morgulis et al. 2006), 80% is accessible using can be reliably called in a eukaryotic parasite at a specificity either DISCOVAR or HC250 compared with 65% accessible that makes it amenable for determining phylogenetic rela- using HC100. tionships. We further demonstrate that P. falciparum can be considered a measurably evolving population (i.e., a popula- Marker Validation tion that can be sampled at different points in time generat- Following the in silico analysis of genome accessibility we ing a statistically significant number of genetic differences— assessed specificity and sensitivity of all three genotyping Drummond et al. 2003)and is thereforeasuitablesubject for approaches by comparing a sequenced DD2-2D4 sample to transmission network reconstruction. The wider use of these a validated set of genotype calls from the Pf-Crosses variant techniques would greatly increase the resolution of genomic set (Miles et al. 2016; Pf-Crosses data available at https://www. epidemiology studies in malaria and could have a dramatic malariagen.net/data/pf-crosses-1.0). Receiver-Operator impact on efforts to eliminate the disease. Characteristic (ROC) curves were generated for variants in regions that were accessible to all callers (the “core” genome) Results and variants found in the additional genome accessible to To increase our ability to detect rapidly arising mutations in that caller (the “extended” genome; supplementary fig. S1, the P. falciparum genome, a modified library preparation Supplementary Material online]. Specificity rates were similar method was employed in which large volumes of DNA across all callers, with DISCOVAR being the most specific, were prepared without a PCR step (minimizing the introduc- albeit at a significant cost to sensitivity (DISCOVAR precision tion of amplification errors); 250 bp Illumina reads were gen- 0.34/sensitivity 0.39; HC250 0.28/0.87; HC100 0.31/0.88); these erated on a size-selected 450 bp fragment (ensuring results were similar when we included the extended genome overlapping reads that could be joined into one long frag- (DISCOVAR 0.33/0.38; HC250 0.31/0.87; HC100 0.28/0.86) ment); these reads were genotyped with DISCOVAR, an al- suggesting that there is little drop in specificity in low- gorithm designed specifically for this data type (Weisenfeld complexity regions of the genome. It is notable that the et al. 2014), as well as GATK HaplotypeCaller (DePristo et al. ROC curves for HC250 and DISCOVAR show “false positive” 2011). The extent and accuracy of these techniques were variants even at the highest VQSLOD scores and in all regions assessed using laboratory isolates of well-defined parasite of the genome, which likely represent real variants that were 1679 Downloaded from https://academic.oup.com/mbe/article/35/7/1678/4962453 by DeepDyve user on 20 July 2022 Redmond et al. doi:10.1093/molbev/msy059 MBE a) b) c) 1% 9% 18% 1% 7% 1% 10 4% 6 0 59% DISCO HC250 HC100 DISCO HC100+HC250 HC250 DISCO+HC250 HC100 DISCO+HC100 INACCESSIBLE ACCESSIBLE INACC.(TELOMERE) 0e+00 1e+06 2e+06 3e+06 position (bp) FIG.1. Genome accessibility was assessed based on in silico comparisons of two P. falciparum genome assemblies. Variants were called from reads simulated from the Pf_Dd2 reference and aligned to the Pf3D7_v3 reference and assessed as correct or incorrect based on a 200 bp flanking region on either side. Blocks of 1 kb across the genome were defined as accessible only if they had a read coverage within two standard deviations of the median and if they had called all variants accurately. Comparisons were made between DISCOVAR and HaplotypeCaller with both 250 bp (HC250) and 100 bp reads (HC100). Subplots show (a) the genome-wide distribution of accessible blocks, (b) the proportions accessible to combinations of callers, and (c) the overall genome area accessible to each caller. The strong overlap of HC250 and DISCOVAR calls suggests that the majority of this improvement derives from the use of longer reads; accessibility increased from 64.0% with HC100 to 78.4% with DISCOVAR and 79.3% with HC250. 10.8 Mb of low complexity sequence was identified by DustMasker, of which 80.8% 81.4%, and 65.4% was found to be accessible by DISCOVAR, HC250, and HC100, respectively. not callable with the 100 bp reads that were used for the Pf- tree, we compared the proportion of variants that supported crosses study. each bipartition, contradicted it (requiring a homoplasy, re- version or recombination to explain the split), or were irrel- Higher Specificity Genotyping Produces Robust evant (e.g., a variant contained entirely within one clade). The Phylogenetic Inferences mean proportion of variants that support the phylogeny of We prioritized specificity for subsequent analyses given the the largest clade was 0.75. This result is consistent with a high motivating application for these data was distance-based specificity and lower sensitivity genotyping method and does transmission network reconstruction methods, which are not suggest any recombination has taken place in these sam- sensitive to false positive calls. Using the combination of ples. Despite the relatively short sampling times covered (only 250 bp reads and DISCOVAR variant calls, we genotyped 6 years in the largest clade) bootstrap support for the samples from a clinical isolate collection from Senegal, repre- DISCOVAR phylogeny is well above 50% for most within- senting three clades (24, 26, and 29—Daniels et al. 2015). clade relationships (mean bootstrap value 0.71); samples Samples within these clades were indistinguishable by previ- with low bootstrap support were separated by no more ous techniques and represented independent infections col- than a year. lected over multiple years. To maximize specificity, De Novo Variants Can Distinguish IBD Parasites DISCOVAR calls were produced independently from each of two lanes of sequencing data, allowing discordant geno- For subsequent analyses, we have worked under the assump- tion that the samples within clades have not undergone mei- types to be discarded and further reducing the absolute abun- otic outcrossing with other lineages from the same clade. The dance of false positive calls. A phylogeny was calculated using maximum parsimony based on the Manhattan distance of Thie `s region is characterized by low transmission and a pre- vious study of 974 samples from this region, covering the SNPs and INDELs combined (SNP/INDEL evolutionary rates same period of time, showed a low prevalence of complex are not known a priori; fig. 2 and supplementary fig. S2, Supplementary Material online). (multilineage) infections (Galinsky et al. 2015). Although we A Lento/bipartition plot (Lento et al. 1995)was generated cannot absolutely rule out recombination between IBD indi- viduals, complexinfectionsareaprerequisite forrecombina- for this phylogeny (supplementary fig. S3 and table ST2, Supplementary Material online); for each bipartition of the tion. Given the low population prevalence of parasites chromosome Mb Downloaded from https://academic.oup.com/mbe/article/35/7/1678/4962453 by DeepDyve user on 20 July 2022 Disease Transmission Pathways in Clonal Malaria doi:10.1093/molbev/msy059 MBE Th162.12 Th162.12 a) b) 0.91 Th230.12 Th230.12 0.47 Th196.12 Th196.12 Th132.11 Th132.11 0.36 Th074.13 Th074.13 Th106.09 Th106.09 0.84 Th117.11 Th117.11 0.7 Th134.11 Th134.11 Th106.11 Th106.11 Th086.07 Th086.07 Th211.13 Th211.13 0.25 Th246.13 Th246.13 0.43 Th092.13 Th092.13 0.77 Th166.12 Th166.12 Th245.13 Th245.13 Th061.13 Th061.13 0.99 Th095.13 Th095.13 100 (vars) Th068.12 Th068.12 FIG.2. Phylogenetic resolution of parasite samples using standing variation versus de novo variants. (a) Maximum parsimony tree of standing variation (24-SNP molecular barcode—Daniels et al. 2015). The last two numerals in sample names indicate the year of collection. Clades are resolved, but samples within clades are indistinguishable. (b) More than 3,000 de novo variants were called via DISCOVAR that segregated the individuals into three IBD clades. Samples within clades 24 (red), 26 (blue), and 29 (green) were separated by 622, 828, and 1858, variants, respectively, with the closest individuals (Th106.09/Th074.13) distinguished by 88 de novo variants. Phylogenetic trees were calculated using 15276 SNPs and 11420 INDELs using maximum parsimony. Numbers on nodes indicate bootstrap support. Our ability to discern phylogeny using only de novo variants was high with bootstrap values above 0.5 for all nodes subtending samples collected at >1 year apart. belonging to each clade, and the attendant low probability descended from a single common ancestor without recom- that asinglehostwould have been simultaneously infected bination, the clades were therefore considered identical by by two parasite lineages of the same clade, we conclude that descent (IBD). Within clades, pairwise distances ranged from outcrossing is unlikely to have occurred. Variants found within 88 to 677 variants, with both extremes being present in a clonal expansion are therefore likely to represent de novo clade 29. As might be expected, the oldest sample mutation. Of the 26,696 variants that distinguished the three (Th086.07, collected in 2007) exhibited a greater genetic clades from each other, 3,200 segregated within the clades distance with respect to the other samples in clade 29, (622, 828, and 1858 for clades 24, 26, and 29, respectively). though this pattern was not repeated in the other two Some loci segregated within more than one clade (9 loci in clades consisting of more closely contemporaneous sam- clades 24 and 26; 39 in 24 and 29; 57 in 26 and 29; 1 locus ples (2012–2013; fig. 2). segregated in all three clades) (supplementary fig. S2, Variants were found predominantly in intergenic sequence Supplementary Material online); though many of these loci for both SNPs (78.5%) and INDELs (83.7%); within coding showed different alleles in different clades, we are unable to sequence in-frame INDELs and missense SNPs generate a fur- distinguish between independent mutations versus shared ther 15% of the variants in each class (table 1) indicating that variation. As might be expected, variants in the extended ge- selection would not significantly confound phylogeny recon- nome showed a higher degree of miscalling, with each clonal struction. Perhaps surprisingly, whereas we would expect the lineage having a different degree of miscalling—indicating a majority of sequencing errors to be observed as singletons, haplotype specific effect (supplementary table ST1, the proportions of nonsense variants did not increase when Supplementary Material online). Genes that are frequently considering only singletons, indeed the proportion of inter- used for the study of parasite diversity were also examined; genic variants significantly increased when considering only no variants of any type were found within any of the clades for singletons. This could represent terminal-branch variants in MSP1, MSP2,SERA2, TRAP,or CSP consistent with the earlier the rapidly mutating intergenic regions or could imply that sequencing carried out on these samples (A. Bei—personal false positive variants deriving from alignment error (which communication). The uniformity of these markers within an would be less likely to be found as singletons) are more prev- otherwise panmictic population indicated that each clade alent in coding regions. 1681 Downloaded from https://academic.oup.com/mbe/article/35/7/1678/4962453 by DeepDyve user on 20 July 2022 Redmond et al. doi:10.1093/molbev/msy059 MBE Table 1. Mutation Consequences by Variant Class and Allele Frequency. Variant Consequence INDEL SNP Nonsingleton Singleton frameshift_variant 12 (1.0%) 0 (0.0%) 4 (0.2%) 8 (0.3%) inframe_deletion 87 (7.1%) 0 (0.0%) 27 (1.6%) 60 (2.4%) inframe_insertion 85 (6.9%) 0 (0.0%) 9 (0.5%) 76 (3.0%) intergenic 1025 (83.7%) 2331 (78.5%) 1415 (86.0%) 1941 (76.1%) intragenic_variant 0 (0.0%) 1 (0.0%) 0 (0.0%) 1 (0.0%) missense_variant 12 (1.0%) 445 (15.0%) 142 (8.6%) 315 (12.4%) non_coding_exon_variant 4 (0.3%) 10 (0.3%) 3 (0.2%) 11 (0.4%) stop_gained 0 (0.0%) 12 (0.4%) 3 (0.2%) 9 (0.4%) synonymous_variant 0 (0.0%) 171 (5.8%) 43 (2.6%) 128 (5.0%) Phylogeny Shows Distinct Substructure within a Single Clonal Expansion Th162.12 An “unbalanced” phylogeny can indicate the presence of Th196.12 super-spreader events or long infection chains (Colijn and Th132.11 Th074.13 Th230.12 Gardy 2014) both of which will generate more internal nodes Th106.09 that are connected to a single leaf descendent. Sackin and r = 0.653, p = 0.04 Colless indices were used to assess the phylogenies and both measures showed some elevation above expectations for a Th106.11 Th117.11 balanced tree, however, none of the clade phylogenies were Th134.11 found to be significantly unbalanced (Blum and Franc ¸ois Th086.07 2005), nor was the overall phylogeny of all three clades, and we cannot conclude that any of these phylogenies represents a super-spreader event or a single infection chain. Although all of the samples within a clade are descended from asinglecommonancestoroverthe time since thisre- cent common ancestor, we can expect some structure to have emerged within these clades. The parsimony tree (fig. 2b) for the largest clade 29 showed two distinct subclades, one consisting of three samples (subclade 29.1: Th117.11, 2000 2000 2005 2010 2015 Th134.11, Th106.11) and another larger clade with six samples Collection Date (subclade 29.2: Th162.12, Th230.12, Th196.12, Th132.11, FIG.3. Root-to-tip distance in the phylogeny of clade 29 correlates Th074.13, Th106.09), with further subdivisions in the second with sampling time. The observed correlation of genetic distance and clade. We conclude from this that the two clades represent time (via Pearson’s product moment) indicates that many of our distinct clonal expansions that have persisted separately variants are de novo and that mutation occurs at a sufficiently high within the population for at least 4 years, despite being trans- rate to resolve patterns of malaria transmission. Regression from mitted within the same geographical area (supplementary fig. sampling times indicates a common ancestor for the clade may S4, Supplementary Material online). have existed in approximately the year 2000. Clade 29 Is a Measurably Evolving Population other two clades, clade 29 was sampled over a far greater It is important to establish whether or not any of these clades period of time, consisting of samples from 2007 to 2013; represent a measurably evolving population (MEP): a popu- the MEP analysis showed an MRCA in the year 2000 which, lation in which a set of samples can be seen to have evolved although based on a small number of samples, would imply between different sampling times. In an MEP much or all of we have sampled the clade for a little over half of the time it the variation will have accrued de novo between the first and has been extant. Consequently this clade showed a much last sampling years rather than due to divergence from a stronger phylogenetic signal, with TreePuzzle analysis resolv- common ancestor before sampling began; as a result, the ing 71.4% of all quartets (supplementary fig. S5, root-to-tip distance of the resultant phylogeny should corre- Supplementary Material online). This clade is therefore suit- late with the sampling time, because samples collected later able for inference of evolutionary rates and can be used to have had more opportunity to diverge from the most recent reconstruct a transmission network. commonancestor(MRCA)(Drummond et al. 2003). The two smaller clades were not sampled over sufficient De Novo Mutations Can Reconstruct Transmission time for us to confirm that we had detected an MEP, how- ever, clade 29 showed significant correlation between root-to- Pathways Transmission networks were calculated using a minimum- tip distance and time (R¼ 0.65, P¼ 0.04) via TempEST anal- ysis (fig. 3) and was confirmed as being a MEP. Unlike the distance based method (Jombartetal. 2011) that relied on Root-tip Distance (variants) Downloaded from https://academic.oup.com/mbe/article/35/7/1678/4962453 by DeepDyve user on 20 July 2022 Disease Transmission Pathways in Clonal Malaria doi:10.1093/molbev/msy059 MBE Evolutionary Rates Indicate That De Novo Genomic Th074.13 Epidemiology Studies Are Viable in P. falciparum Evolutionary rates were calculated for all pairs within the [1.0] transmission network for clade 29 using the method of Li 0.03 et al. (1988). Highly variable evolutionary rates were seen be- Th162.12 [0.5] tween samples and a larger study would be necessary in order [1.0] 0.02 Th230.12 to derive areliableevolutionaryratefor P. falciparum. [1.0] Th086.07 Th106.09 121 [0.48] [1.0] 0.02 However, comparisons of the evolutionary rate within core 0.11 [ 0.02] Th132.11 and extended regions of the genome do show an increase in [0.41] 139 our ability to discern an MEP; the core genome acquires 0.84 0.06 Th196.12 [0.59] (61.8) mutations/month/genome while the extended regionsgaindenovovariantsatthree times thisrate2.08 [0.97] [0.03] 0.29 (61.3) muts/mth/gen (supplementary fig. S6a, [0.76] Th134.11 Supplementary Material online). The combined evolutionary 0.05 rate for all mutation classes is therefore 2.92 (62.3) muts/ [0.24] Th106.11 mth/gen (fig. 5 and supplementary fig. S6b, Supplementary [1.0] Material online). This figure is comparable to the high rate of [bootstrap] Th117.11 0.04 Distance reversion rate mutation acquisition in some RNA viruses (fig. 5 and supple- mentary fig. S7, Supplementary Material online) and indicates FIG.4. Transmission networks were estimated for clade 29 based on that direct transmissions typically acquire at least some de SNP and INDEL distances combined using a minimum distance tree novo mutations between host infections. approach. Edges are labeled with pairwise genetic distance, as well as bootstrap values (superscript in parentheses) and reversion rate (sub- Discussion script). Bootstrap values were calculated for all edges by sampling This study demonstrates for the first time the viability of using with replacement for 100 iterations and indicate strong support for de novo mutation to characterize transmission in a eukary- many of these distance-derived relationships. Potential alternative edges derived from the bootstrap results are shown in gray. Nodes otic pathogen and further shows how this technique could be in later years are shown in darker colors. Concordance between the used to study the transmission of malaria—a disease that, parsimony tree and distance based methods is notable; in both the despite a worldwide elimination campaign, still kills phylogeny and transmission network sample Th106.09 is basal to >400,000 people per year (WHO 2016). To access rapidly subclade 29.1 and Th106.11 to subclade 29.2. Reversion rates are sig- evolving low-complexity sequence, we have used long se- nificantly higher for the edge joining Th106.09 to Th106.11 support- quencing reads, PCR-free library preparation and a large k- ing the independence of subclade 29.2 mer during the assembly stage, opening up a further 14% of the P. falciparum genome to accurate genotyping. The in- Manhattan distances of both variant types combined creased coverage in turn enables more rapidly mutating var- (fig. 4). Bootstrap values were high for most relationships iants to be called than was previously possible, and the with a majority of secondary transmission paths weakly accompanying increased specificity prevents the de novo sig- supported. The distance-derived transmission path pro- nal from being lost amidst the noise of sequencing or align- vided additional support for the multiclade structure that ment error. Within a set of closely related samples from was seen in the maximum-parsimony tree results, indi- patients in Senegal (Daniels et al. 2015), we were able to cating a separate expansion of subclade 29.1 and 29.2. call 3,200 variants that successfully resolved ancestry among Both of these clades were found to be related to the basal these parasites and generated robust phylogenies for what sample Th086.07. appeared to be identical clones when assayed at lower genetic As a secondary confirmation of the network, we resolution. mapped individual variants to each vertex in the network Two of the clades we studied were sampled across only 2 and calculated the reversion rate across all primary edges years each and would not have the capacity to show the (i.e., the number of de novo mutations that would need correlation between sampling time and root-to-tip distance to be lost to support each step); we expect that variation that would suggest these mutations had arisen during the accrued on each lineage since divergence as well as gen- sampling period. On the other hand, the third clade persisted otyping errors will contribute to this statistic, however, for a total of 7 years and was determined to be a measurably these two types of “variants” would be indistinguishable. evolving population. This level of resolution is a first for a The majority of edges showed reversion rates of 11% or eukaryotic pathogen over time scales relevant to disease less, however, the edge connecting subclade 29.2 to the transmission (Biek et al. 2015) and supports the use of de rest of the samples (Th106.09 through Th106.11) showed novo variants to examine the phylodynamics of malaria in the a markedly higher reversion rate (0.29). This discrepancy future. indicates a lack of sampling depth in the early years of the To demonstrate the utility of this approach, a transmission expansion and shows that we have not captured a true network was inferred for the largest clade using a minimum- progenitor of subclade 29.2. distance approach (fig. 4; Jombartetal. 2011). For 1683 Downloaded from https://academic.oup.com/mbe/article/35/7/1678/4962453 by DeepDyve user on 20 July 2022 Redmond et al. doi:10.1093/molbev/msy059 MBE 1.4 wks 1.5 wks 1.6 wks 1.9 wks RNA virus Bacteria Eukaryote 2.4 wks core extended 3.9 wks 10 1.3 mths 3.3 yrs FIG.5. Lower mutation rates and the restricted genome size of the core genome would generate a single new mutation on average every month. The increase in accessibility offered by 250-bp reads and DISCOVAR increases both genome size and measurable evolutionary rate, resulting in a new variant every 2.4 weeks in the low complexity genome and every 1.5 weeks overall. This makes P. falciparum comparable to other infectious agents like influenza or HIV where transmission networks may be informed by genome sequencing. Mutation rates derived from Biek et al. (2015). characterizing malaria transmission, de novo variation repre- (Th132.11, Th196.12, and Th230.12). It is important to note sents a significant advance over standing variation, which that this conclusion would have been impossible to draw would be unable to identify directional relationships between from a combination of standing variation and epidemiolog- infections. ical data: although all six of these samples were found within Evolutionary ratesascalculatedacrossthe third-clade the Thie `s region, neither clade separates geographically and transmission network imply one new mutation is acquired individual members from separate clades were found in close every 1.5 weeks (fig. 5), meaning that at least one new muta- proximity (supplementary fig. S4, Supplementary Material tion should be found between direct transmissions. Although online). Perhaps more remarkably, despite 70% of the samples an- detecting a single mutation between individuals without any genotyping errors in a 23 Mb genome might be challenging alyzed here having been taken in the intervening years, sam- with current sequencing technologies, this does indicate that ple Th074.13 derives from Th106.09, and is entirely sufficient de novo mutations to reconstruct transmission independent of the other two clades. This means that at pathways can be expected in genomic epidemiology studies the beginning of the dry season in 2011 (when transmission would be expected to tail off) at least three separate clades of that were conducted over 2 years ormore—timesscalesthat are highly relevant to malaria transmission. this particular strain were circulating in the Thie `s region and Although the transmission network was generated as a all survived to subsequent wet seasons. Maintenance of these “proof-of-principle”, low rates of reversion indicate that, parasite types suggests a sizable asymptomatic reservoir of even if the number of intervening generations remains unre- infection may have been present. solved, many of the relationships identified are accurate and Intriguingly, the smallest of those clades consists of just one sample Th074.13; this sample came from a village 16 km out- some firm conclusions can be drawn from the network about the nature of transmission within Senegal. side of Thie `s and is separated from its closest node in the We can eliminate the possibility that these clonal parasites network by 6 years (supplemenatary fig. S4, Supplementary persist through a single chain of infections: an infection chain Material online), yet it is separated from this parasite by just would generate a tree with each node joined only to its de- 88 variants—the shortest genetic distance in our data set scendent (Colijn and Gardy 2014), yet both the tree and found over the longest intervening time and distance. As a transmission network show strong internal structure with result of the large population sizes that P. falciparum reaches well supported distinct clades; in particular, the clade of three during the erythrocytic life stages we do not expect de novo samples in 2011 (Th106.11, Th117.11, and Th134.11) is not mutations to be commonly fixed within the host, but instead directly related to the other clade found from 2011 to 2012 expect that differences in the consensus sequences of related EBOV HIV−M HFLUV H. pylori S. aureus M. tuberculosis P. falciparum evol. rate (muts / genome / year) Downloaded from https://academic.oup.com/mbe/article/35/7/1678/4962453 by DeepDyve user on 20 July 2022 Disease Transmission Pathways in Clonal Malaria doi:10.1093/molbev/msy059 MBE infections will only be fixed by passage through population In summary,wehaveshown aeukaryoticparasitetobe a bottlenecks during mosquito transmission. The lower evolu- measurably evolving population across time scales that are tionary rate found between these two samples could be a relevant to disease transmission, and we have calculated evo- result of long-term asymptomatic infections during which no lutionary rates that indicate de novo mutations could be new variants would expect to be fixed. Further investigations found even between direct infections. This type of genetic of the relationship between evolutionary rate and serial in- signal is qualitatively different to those that have been avail- terval would be of great interest and may ultimately allow us able in previous genomic epidemiology studies in malaria and to use genetic distance to infer the duration of asymptomatic will enable finer grained characterization of disease transmis- infections and their contribution to local disease incidence. sion. Though this data set is relatively small, we are limited by Although calculations of TMRCA are difficult from such a the requirement to use only clonal outbreaks for the network small sample set, the placement of the basal node in the year inference and challenges remain before these techniques can 2000, the diversity contained within the phylogeny, and the be used in larger data sets from higher-transmission settings, number of persisting clades all suggest that this clonal lineage where sexual recombination will add complexity to the re- has been long-lasting and did not emerge with sample construction. We hope that future data sets revealing de Th086.07 in 2007. novo mutations in larger collections of patient samples will Even with this limited data set we have successfully derived motivate the development of more sophisticated methods conclusions about the patterns of disease transmission in this for transmission-network reconstruction; methods that could region, but the capacity to easily derive transmission networks work on recombined samples or could tolerate a lower degree could be transformative for larger data sets, particularly in of genotyping specificity. Nevertheless, there are many regions preelimination settings. Reiner et al. (2015) have demon- in which the techniques employed here are of immediate use. strated the capacity for network reconstruction to enable As of the end of 2015, the WHO has certified 27 countries as assessments of “malariogenic potential” (a combined mea- malaria-free, three more countries are in the process of cer- sure of disease importation rates and capacity for local trans- tification (having had 3 years without indigenous transmis- mission) in Swaziland, a country nearing elimination. The sion), and a further ten countries have had at least 1 year network generated by Reiner et al. is derived from probabil- istic modeling of transmission based upon the times and without indigenous transmission (WHO 2016). In any of these geographic locations of the infections and would enable pub- settings there could be outbreaks that are partially or entirely lic health authorities to target interventions such as focal clonal. Reconstruction of a transmission network in the man- mass drug administration (FMDA) where they are most nerthatwehaveachievedhere would be able to detect any needed. It is worth noting that our network inference indi- links that were entirely within-country, thus distinguishing cates that some samples will confound expectations of spa- imported from indigenous transmission. These data could tiotemporal proximity, and validation with genetic data also identify “super-spreaders” that might be targeted with wouldbeavaluableadditiontosuchmapping attempts. focal interventions. The capacity to distinguish between in- In other pathogens network reconstruction using de novo digenous and imported transmission or to pinpoint hotspots variation has already led to actionable results for infectious- in countries nearing elimination could be key capacities for disease control and these situations could apply for malaria. malaria elimination efforts. There are now numerous cases in which whole-genome se- quencing and network reconstruction have been deployed in Materials and Methods real time to identify the source of a hospital-acquired infec- tion leading directly to the identification of reservoirs of per- Lab Strain Sample Preparation sistent disease transmission (Halachev et al. 2014; Quick et al. DISCOVAR libraries were prepared for two Dd2 lines (Dd2- 2015). Even on a wider geographic scale, where comprehen- 2D4 and Dd2-B2) and one 3D7 line (3D7-A-10). The Dd2-2D4 sive sampling of an outbreak cannot be guaranteed, the ap- line is derived from clone MRA-156 (originally deposited by proach demonstrated here can generate concrete Thomas E. Wellems and available from BEI resources, NIAID, recommendations for public health authorities. NIH https://www.beiresources.org/Catalog/ Transmission-network reconstructions of a tuberculosis out- BEIParasiticProtozoa/MRA-156.aspx) whereas clones Dd2-B2 break in Canada were able to show that later cases were and 3D7-A10 were derived from the MALDA/MDTIP consor- progressions from latent to active disease, rather than active tium (Cowell et al. 2018) and are available on request. transmission, with the direct result that the outbreak was Although the Dd2 lines derive from a common subcloned declared over, a conclusion that was reached solely by com- patient isolate, small numbers of de novo variants are parison of the rate of de novo mutation between samples expected to have been acquired via mutation and genetic (Hatherell et al. 2016). Adapting such a capacity to malaria drift in the years since the laboratory lines diverged. Three would have major implications for elimination efforts: malaria biological replicates (using separate isolates of DNA) were elimination within a country is defined as zero incidence of prepared for the Dd2-B2 and 3D7-A10 samples, and one li- indigenous cases (WHO 2016) and the ability to distinguish a brary of the Dd2-2D4. All libraries were sequenced in dupli- chain of transmission from repeated importations without cate sequencing lanes allowing us to remove individual recourse to a multicountry data set could be a key capacity for the WHO Global Malaria Program. miscalled bases where they were not consistent across lanes. 1685 Downloaded from https://academic.oup.com/mbe/article/35/7/1678/4962453 by DeepDyve user on 20 July 2022 Redmond et al. doi:10.1093/molbev/msy059 MBE Senegal Sample Preparation from the original 250 bp reads. This was followed by quality Samples used in this study derive from prior collections score recalibration using GATK VQSR (DePristo et al. 2011)— (Daniels et al. 2013, 2015)inThie `s, Senegal. All samples filtering by VQSLOD score with a 90% retention threshold. were collected from individuals after informed consent of The VQSR truth-set consisted of variant loci that had been either the subject or a parent/guardian. This protocol was previously determined in the “Pf-Crosses” data set generated reviewed and approved by the ethical committees of the by Miles et al. (2016).Thisdataset consists of aseriesof Senegal Ministry of Health (Senegal) and the Harvard T. H. genetic crosses of P. falciparum clones (3D7  HB3, HB3 Chan School of Public Health (16330, 2008) as detailed in Dd2, and 7G8  GB4) in which parents and all offspring were Daniels et al. (2015). Samples were collected passively from deeply sequenced, enabling variant calls to be filtered accord- patients reporting to the “Service de Lutte Anti-Parasitaire” ing to conformity with Mendelian inheritance expectations. (SLAP) clinic for suspected uncomplicated malaria between approximately August and December each year. Patients with Variant Validation acute fevers or history of fever within the past 24 h of visiting The P. falciparum genome presents a challenge to many ge- the clinic and with no reported history of antimalarial use notype callers due to extensive repeat structure and recom- were considered; they were diagnosed with malaria based on bination within some large gene families (Bopp et al. 2013), as microscopic examination of thick and thin blood smears and well as the high A/T nucleotide composition of the intergenic rapid diagnostic tests (RDT), as available. All samples were regions (Gardner et al. 2002). Moreover, validation of variant collected via venous blood draw and stored as glycerolyte calls in low-complexity regions of the P. falciparum genome stocks, and an aliquot of material collected on Whatman filter via conventional PCR and Sanger dideoxy sequencing is dif- paper was extracted for nucleic acid material and genotyped ficult, as PCR amplicons evaluated were generally too repet- via 24-SNP barcode as described in Daniels et al. (2013).From itivetoyield interpretablesequencetraces. As aresult, atwo- this larger data set, 18 samples were chosen representing stage concordance-based approach to genotyping was taken: single infections from clades 24, 26, and 29, (3, 5, and 10 1) the “accessible” genome was determined separately for samples, respectively) based on identical genotypes observed each genotype caller; 2) genotypesinthe regionsaccessible using the 24-SNP barcode. The 18 parasite samples were to both callers were compared with a known sample and culture-adapted to produce parasite DNA free of host (hu- assessed for specificity. man) contamination. Culture times were kept to a minimum Accessible Genome Determination. Miles et al. used the Pf- (<15 cycles) to preserve the integrity of the original patient Crosses data set in order to characterize the “accessible material, however, the acquisition of confounding de novo genome”—the region in which genotyping calls could be re- mutations within this culturing stage cannot be ruled out. liably made (Miles et al. 2016). The authors found 90% of the genome to be within an acceptable error rate, and only hy- Sequencing and Variant Discovery pervariable regions, centromeres and telomeres were ex- DNA extracted from all samples was size selected to give a cluded. DISCOVAR generates few variables upon which mean fragment length of 450 bp and sequenced using an variant recalibration could be performed; therefore, we took Illumina HiSeq 2500 with a read length of 250 bp, ensuring a stricter approach. Using in silico sequence generated from a an overlap of approximately 50 bp. Sequence was aligned to second genome assembly, we determined a region to be ac- the Pf3D7_v3 reference using BWA-mem v. 0.7.12 (Li 2014). cessible only if the variant caller can successfully reconstruct All alignments were deposited at the NCBI Short Read the region around all variants within it. Three variant calling Archive (SRA) database (accession: SRP137271). approaches were assessed: DISCOVAR using 250 bp reads, Variant discovery using DISCOVAR (release no. r52488) HaplotypeCaller (DePristo et al. 2011) with 250 bp reads (Weisenfeld et al. 2014) was performed with 250 bp reads (HC250) and HaplotypeCaller with 100 bp reads (HC100). on 450 bp fragments. The Pf3D7 genome was subdivided An artificial (in silico) Dd2-2D4 read set (250 bp fragment into 30 kb regionsand DISCOVAR wasrun separately foreach size, 450 bp insert size) was generated from a PacBio assembly region. In regions where it was not possible to construct a of the Dd2-2D4 line of P. falciparum (ftp://ftp.sanger.ac.uk/ single assembly graph for variant calling, DISCOVAR was pub/project/pathogens/Plasmodium/falciparum/DD2/) us- rerun with progressively smaller regions (10, 5, and 2 kb) of ing wgsim v0.3.2 (Li et al. 2009) using default parameters the genome. In all cases a 1 kb overlap was allowed to control and generating 100X coverage. This read set was aligned to for edge effects, though variants were only included up to the the standard Pf3D7_v3 reference assembly and genotypes inner edge of this flanking region. All libraries were sequenced were called using DISCOVAR and HaplotypeCaller as detailed and genotyped in duplicate to remove individual miscalled above. For each resulting variant, the putative Dd2-2D4 se- bases derived from sequencing error. Resulting genotypes quence in a 200 bp region around the variant was recon- were filtered based on Phred score (i.e., no call above 23 or structed and realigned to the Dd2-2D4 PacBio reference via any below ten included) and hypervariable sites (those for Smith–Waterman alignment (BWA-SW [Li and Durbin which the genotyping software generated >50 potential al- ternative variants) were removed. 2010]). Any differences between this locally reconstructed Variant discovery using GATK HaplotypeCaller was per- Dd2-2D4 sequence and the Dd2-2D4 PacBio assembly (i.e., formed with both 250 bp reads and a more typical data set the Levenshtein distance between the two strings) would (100 bp reads with a 300 bp insert size) that was generated therefore represent errors derived either from misalignment 1686 Downloaded from https://academic.oup.com/mbe/article/35/7/1678/4962453 by DeepDyve user on 20 July 2022 Disease Transmission Pathways in Clonal Malaria doi:10.1093/molbev/msy059 MBE of the simulated reads or from the genotyping algorithm Temporal data were assessed by examining the correlation itself. This Smith–Waterman approach was used to compare between sampling time and the root-to-tip distance of a calls made using DISCOVAR using 250 bp reads, “non-clock” phylogenetic tree (in this case one derived by HaplotypeCaller using the same 250 bp reads, and via neighbor-joining) using TempEST v1.5 (Rambautetal. HaplotypeCaller using 100 bp reads. For each variant calling 2016). Correlation between genetic distance and time was method, the “accessible genome” was defined as those assessed via Pearson’s product moment. Transmission net- regions where alignment depth was within one standard de- works were reconstructed using SeqTrack within adegenet viation of the mean and where the Levenshtein distance (LD) (Jombartetal. 2011) based on the Manhattan distance of was zero (i.e., where reconstruction of Dd2-2D4 had been all variants: an approach that has been shown to outperform error free). Low complexity regions of the genome were parsimony based methods where sampling density is low assessed by running Dustmasker (version 1.0 in package (Worby et al. 2017). Bootstrapping of the transmission net- BLASTþ 2.2.30 using default parameters). work was performed by sampling all variants with replace- ment for 100 iterations; bootstrap values are reported as the proportion of bootstrapped networks supporting each join. Accessible Marker Validation An experimentally validated set of genotypes was then used Evolutionary Rate Calculation for confirmation of variants within those regions of the ge- For samples sharing very recent common ancestry, pairwise nome that were accessible to both HaplotypeCaller and variant call differences are expected to be dominated by se- DISCOVAR. This biological validation was performed using quencing error, leading to overestimation of the rate at which three laboratory lines of P. falciparum,eachsourced from de novo mutations accumulate. Where a transmission net- different laboratories and available on request. The called work could be resolved for a clade, evolutionary rates were variants were then compared with the set of previously val- therefore calculated for each IBD pair as the difference be- idated “Pf-Crosses” variants using VCFEval (Cleary et al. 2015): tween the distance of each sample to an outgroup (chosen receiver-operator curves (ROC curves—i.e., plots of true pos- from one of the other two clades) to account for ancestral itive and false positive rates at equal quality scores) and spe- diversity (Li et al. 1988). This was performed separately for the cificity/sensitivity values were calculated in the regions that regions of the genome determined to be callable with had been previously determined to be accessible by both DISCOVAR, HC250, and HC100 and the DISCOVAR-only ge- genotype callers and in the regions determined to be acces- nome yielding an estimate of the contribution to the evolu- sible only to DISCOVAR. Pf-Crosses data is available from: tionary rate of “core” and “extended” regions of the genome. https://www.malariagen.net/data/pf-crosses-1.0 The mean evolutionary rate is reported for each pair of sam- ples using all potential outgroups. Phylogeny and Transmission Networks SNP and INDEL distance matrices were calculated based on Supplementary Material the number of mutations between each sample pair, without Supplementary data are available at Molecular Biology and making adjustments based on putative mutation rate or STR length (all INDELs were treated as a single mutation regardless Evolution online. of length or whether they were found within a known STR). Acknowledgments Parsimony trees were generated using the R package We are grateful for the assistance of Thomas D. Otto (Sanger Phangorn (Schliep 2011) and neighbor-joining trees were gen- Institute) for providing the Dd2_v1 reference genome for erated using the R package “ape” (Paradis et al. 2004). Phylogenies were calculated by both methods for the pairwise validation. BEI resources, David Fidock (Columbia SNP distance, INDEL distance and the Manhattan distance of University), and Daniel Goldberg (Washington University in St. Louis) for supplying their Plasmodium stocks for sequenc- both SNPs and INDELs. Bootstrapping was performed by ing. We thank the UCAD/LeDantec laboratory and SLAP “sampling with replacement” for 100 replicates across each phylogeny. Bootstrap support is reported as the proportion of clinic team, particularly Dr. Ngayo Sy, and the patients for trees supporting each node. clinical samples. We thank Sin ead Chapman, Caroline Cusick, To assess the phylogenetic signal within the clades, the and Jim Bochicchio for project management, and Steve Schaffner and Angela Early for their insightful comments on Tree-Puzzle program (Schmidtetal. 2002) was used to ana- lyze all potential sequence quartets within the set. For any set the manuscript. Funding for this work at the Harvard T.H. of four individuals, a maximum of three phylogenies are pos- Chan School of Public Health and the Broad Institute was sible, and the TreePuzzle method randomly selects individuals provided by a grant to D.F. Wirth from the Global Health from the data set and assesses by maximum likelihood if it Program at the Bill and Melinda Gates Foundation accords with any of the three phylogenies (see Strimmer et al. (OPP1053604, “Genomic-Based Diagnostics for Elimination for details [Strimmer and von Haeseler 1996]). A quartet with and Eradication of Plasmodium”); publication was paid for clean phylogenetic signal will show high likelihood to one of via the Chronos system. Senegal sample collection was the tree topologies, a network with significant introgression funded by the NIAID (R01AI099105: “Genetic Variation and will show likelihood to more than one, and an unresolved Evolution of Artemisinin Resistance.”). This work was carried star-like phylogeny will show likelihood to none. out with support from the ExxonMobil Foundation. 1687 Downloaded from https://academic.oup.com/mbe/article/35/7/1678/4962453 by DeepDyve user on 20 July 2022 Redmond et al. doi:10.1093/molbev/msy059 MBE elucidates Ebola virus origin and transmission during the 2014 out- References break. Science 345(6202): 1369–1372. Azarian T, Maraqa NF, Cook RL, Johnson JA, Bailey C, Wheeler S, Nolan Halachev MR, Chan JZ-M, Constantinidou CI, Cumley N, Bradley C, D, Rathore MH, Morris JG, Salemi M. 2016. Genomic epidemiology Smith-Banks M, Oppenheim B, Pallen MJ. 2014. Genomic epidemi- of methicillin-resistant Staphylococcus aureus in a neonatal intensive ology of a protracted hospital outbreak caused by multidrug- care unit. PLoS One 11:e0164397. resistant Acinetobacter baumanniiin Birmingham, England. Biek R, Pybus OG, Lloyd-Smith JO, Didelot X. 2015. Measurably evolving Genome Med.6(11):70. pathogens in the genomic era. Trends Ecol Evol. 30(6): 306–313. Harris SR, Feil EJ, Holden MTG, Quail MA, Nickerson EK, Chantratita N, Blum MGB, Franc ¸ois O. 2005. On statistical tests of phylogenetic tree Gardete S, Tavares A, Day N, Lindsay JA, et al. 2010. Evolution of imbalance: the Sackin and other indices revisited. Math Biosci. MRSA during hospital transmission and intercontinental spread. 195(2): 141–153. Science 327(5964): 469–474. Bopp SER, Manary MJ, Bright AT, Johnston GL, Dharia NV, Luna FL, Hatherell H-A, Didelot X, Pollock SL, Tang P, Crisan A, Johnston JC, Colijn McCormack S, Plouffe D, McNamara CW, Walker JR, et al. 2013. C, Gardy J. 2016. Declaring a tuberculosis outbreak over with Mitotic evolution of Plasmodium falciparum shows a stable core genomic epidemiology. Microb Genomics 2(5):e000060. genome but recombination in antigen families. PLoS Genet. 9(2): Jombart T, Eggo RM, Dodd PJ, Balloux F. 2011. Reconstructing disease e1003293. outbreaks from genetic data: a graph approach. Heredity 106(2): Chenet SM, Taylor JE, Blair S, Zuluaga L, Escalante AA. 2015. Longitudinal 383–390. analysis of Plasmodium falciparum genetic variation in Turbo, Lemieux J. 2015. Genomic analysis of evolution in Plasmodium falcipa- Colombia: implications for malaria control and elimination. Malar rum and Babesia microti. Doctoral Thesis: http://nrs.harvard.edu/ J. 14:363. urn-3:HUL.InstRepos:15821587. Cleary JG, Braithwaite R, Gaastra K, Hilbush BS, Inglis S, Irvine SA, Jackson Lento GM, Hickson RE, Chambers GK, Penny D. 1995. Use of spectral A, Littin R, Rathod M, Ware D, et al. 2015. Comparing variant call files analysis to test hypotheses on the origin of pinnipeds. MolBiolEvol. for performance benchmarking of next-generation sequencing var- 12(1): 28–52. iant calling pipelines. bioRxiv. doi: https://doi.org/10.1101/023754 Lenz C, Haerty W, Golding GB. 2014. Increased substitution rates sur- Colijn C, Gardy J. 2014. Phylogenetic tree shapes resolve disease trans- rounding low-complexity regions within primate proteins. Genome mission patterns. Evol Med Public Health 2014(1): 96–108. Biol Evol. 6(3): 655–665. Cowell AN, Istvan ES, Lukens AK, Gomez-Lorenzo MG, Vanaerschot M, Li H. 2014. Toward better understanding of artifacts in variant call- Sakata-Kato T, Flannery EL, Magistrado P, Owen E, Abraham M, et al. ing from high-coverage samples. Bioinformatics 30(20): 2018. Mapping the malaria parasite druggable genome by using 2843–2851. in vitro evolution and chemogenomics. Science 359(6372): 191–199. Li H, Durbin R. 2010. Fast and accurate long-read alignment with Daniels R, Chang H-H, Sene PD, Park DC, Neafsey DE, Schaffner SF, Burrows-Wheeler transform. Bioinformatics 26(5): 589–595. Hamilton EJ, Lukens AK, Van Tyne D, Mboup S, et al. 2013. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Genetic surveillance detects both clonal and epidemic transmission Abecasis G, Durbin R. 2009. The sequence alignment/map format of malaria following enhanced intervention in Senegal. PLoS One and SAMtools. Bioinformatics 25(16): 2078–2079. 8:e60780. Li W-H, Tanimura M, Sharp2 PM. 1988. Rates and Dates of Daniels RF, Schaffner SF, Wenger EA, Proctor JL, Chang H-H, Wong W, Divergence between AIDS Virus Nucleotide Sequences’. Mol Biol Baro N, Ndiaye D, Fall FB, Ndiop M, et al. 2015. Modeling malaria Evol. 5:313–330. genomics reveals transmission decline and rebound in Senegal. Proc Metsky HC, Matranga CB, Wohl S, Schaffner SF, Freije CA, Winnicki Natl Acad Sci U S A. 112(22): 7067–7072. SM, West K, Qu J, Baniecki ML, Gladden-Young A, et al. 2017. DePristo MA, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C, Zika virus evolution and spread in the Americas. Nature Philippakis AA, del Angel G, Rivas MA, Hanna M, et al. 2011. A 546(7658): 411–415. framework for variation discovery and genotyping using next- Miles A, Iqbal Z, Vauterin P, Pearson R, Campino S, Theron M, Gould K, generation DNA sequencing data. Nat Genet. 43(5): 491–498. Mead D, Drury E, O’Brien J, et al. 2016. Indels, structural variation, Drummond AJ, Pybus OG, Rambaut A, Forsberg R, Rodrigo AG. 2003. and recombination drive genomic diversity in Plasmodium falcipa- Measurably evolving populations. Trends Ecol Evol. 18(9): 481–488. rum. Genome Res. 26(9): 1288–1299. Eyre DW, Cule ML, Wilson DJ, Griffiths D, Vaughan A, O’Connor L, Ip Morgulis A, Gertz EM, Sch€ affer AA, Agarwala R. 2006. A fast and sym- CLC, Golubchik T, Batty EM, Finney JM, et al. 2013. Diverse sources of metric DUST implementation to mask low-complexity DNA C. difficile infection identified on whole-genome sequencing. NEnglJ sequences. JComputBiol. 13(5): 1028–1040. Med. 369(13): 1195–1205. Narzisi G, Schatz MC. 2015. The challenge of small-scale repeats for indel FariaNR, QuickJ,Morales I, ThezeJ,deJesus JG,GiovanettiM,Kraemer discovery. Front Bioeng Biotechnol.3:8. MUG, Hill SC, Black A, da Costa AC, et al. 2017. Epidemic establish- Neher RA, Bedford T. 2015. nextflu: real-time tracking of seasonal ment and cryptic transmission of Zika virus in Brazil and the influenza virus evolution in humans. Bioinformatics 31(21): Americas. Nature 546(7658): 406–410. 3546–3548. Flueck C, Bartfai R, Volz J, Niederwieser I, Salcedo-Amaya AM, Alako BTF, Nkhoma SC, Nair S, Al-Saai S, Ashley E, McGready R, Phyo AP, Nosten F, Ehlgen F, Ralph SA, Cowman AF, Bozdech Z, et al. 2009. Plasmodium Anderson TJC. 2013. Population genetic correlates of declining trans- falciparum heterochromatin protein 1 marks genomic loci linked to mission in a human pathogen. Mol. Ecol. 22(2): 273–285. phenotypic variation of exported virulence factors. PLoS Pathog. 5(9): Paradis E, Claude J, Strimmer K. 2004. APE: analyses of phylogenetics and e1000569. evolution in R language. Bioinformatics 20: 289–290. Galinsky K, Valim C, Salmier A, de Thoisy B, Musset L, Legrand E, Faust A, Park DJ, Dudas G, Wohl S, Goba A, Whitmer SLM, Andersen KG, Sealfon Baniecki ML, Ndiaye D, Daniels RF, et al. 2015. COIL: a methodology RS,Ladner JT, Kugelman JR,MatrangaCB, et al.2015. Ebola virus for evaluating malarial complexity of infection using likelihood from epidemiology, transmission, and evolution during seven months in single nucleotide polymorphism data. Malar J. 14:4. Sierra Leone. Cell 161(7): 1516–1526. Gardner MJ, Hall N, Fung E, White O, Berriman M, Hyman RW, Carlton Popovich KJ, Snitkin E, Green SJ, Aroutcheva A, Hayden MK, Hota B, JM, Pain A, Nelson KE, Bowman S, et al. 2002. Genome sequence of Weinstein RA. 2016. Genomic epidemiology of USA300 methicillin- the human malaria parasite Plasmodium falciparum. Nature resistant Staphylococcus aureus in an urban community. Clin Infect 419(6906): 498–511. Dis. 62(1): 37–44. Gire SK,Goba A, Andersen KG,Sealfon RSG, Park DJ,KannehL,JallohS, Quick J, Ashton P, Calus S, Chatt C, Gossain S, Hawker J, Nair S, Neal K, Momoh M, Fullah M, Dudas G, et al. 2014. Genomic surveillance NyeK,PetersT,etal. 2015. Rapid draft sequencing and real-time 1688 Downloaded from https://academic.oup.com/mbe/article/35/7/1678/4962453 by DeepDyve user on 20 July 2022 Disease Transmission Pathways in Clonal Malaria doi:10.1093/molbev/msy059 MBE nanopore sequencing in a hospital outbreak of Salmonella. Genome Strimmer K, von Haeseler a. 1996. Quartet puzzling: a quartet maximum- Biol. 16(1): 114. likelihood method for reconstructing tree topologies. Mol Biol Evol. Rambaut A, Lam TT, Max Carvalho L, Pybus OG. 2016. Exploring the 13(7): 964–969. temporal structure of heterochronous sequences using TempEst Weisenfeld NI, Yin S, Sharpe T, Lau B, Hegarty R, Holmes L, Sogoloff B, (formerly Path-O-Gen). Virus Evol. 2(1): vew007. Tabbaa D, Williams L, Russ C, et al. 2014. Comprehensive variation Reiner RC, Le Menach A, Kunene S, Ntshalintshali N, Hsiang MS, discovery in single human genomes. Nat Genet. 46(12): 1350–1355. Perkins TA, Greenhouse B, Tatem AJ, Cohen JM, Smith DL. 2015. WHO. 2016. WHO j World Malaria Report 2016. Mapping residual transmission for malaria elimination. Elife Worby CJ, Lipsitch M, Hanage WP. 2017. Shared genomic variants: iden- 4:e09520. tification of transmission routes using pathogen deep sequence data. Schliep KP. 2011. phangorn: phylogenetic analysis in R. Bioinformatics Am.J.Epidemiol. 186(10): 1209–1216. 27(4): 592–593. Zilversmit MM, Volkman SK, Depristo MA, Wirth DF, Awadalla P, Hartl Schmidt H. a, Strimmer K, Vingron M, von Haeseler A. 2002. TREE- DL. 2010. Low-complexity regions in Plasmodium falciparum: miss- PUZZLE: maximum likelihood phylogenetic analysis using quartets ing links in the evolution of an extreme genome. MolBiolEvol.27(9): and parallel computing. Bioinformatics 18(3): 502–504. 2198–2209.

Journal

Molecular Biology and EvolutionOxford University Press

Published: Jul 1, 2018

There are no references for this article.