The influence of phylodynamic model specifications on parameter estimates of the Zika virus epidemic

Veronika Boskova; Tanja Stadler; Carsten Magnus

doi:10.1093/ve/vex044

The influence of phylodynamic model specifications on parameter estimates of the Zika virus epidemic

Boskova, Veronika;Stadler, Tanja;Magnus, Carsten 2018-01-01 00:00:00 Each new virus introduced into the human population could potentially spread and cause a worldwide epidemic. Thus, early quantiﬁcation of epidemic spread is crucial. Real-time sequencing followed by Bayesian phylodynamic analysis has proven to be extremely informative in this respect. Bayesian phylodynamic analyses require a model to be chosen and prior distributions on model parameters to be speciﬁed. We study here how choices regarding the tree prior inﬂuence quantiﬁca- tion of epidemic spread in an emerging epidemic by focusing on estimates of the parameters clock rate, tree height, and re- productive number in the currently ongoing Zika virus epidemic in the Americas. While parameter estimates are quite ro- bust to reasonable variations in the model settings when studying the complete data set, it is impossible to obtain unequivocal estimates when reducing the data to local Zika epidemics in Brazil and Florida, USA. Beyond the empirical in- sights, this study highlights the conceptual differences between the so-called birth–death and coalescent tree priors: while sequence sampling times alone can strongly inform the tree height and reproductive number under a birth–death model, the coalescent tree height prior is typically only slightly inﬂuenced by this information. Such conceptual differences to- gether with non-trivial interactions of different priors complicate proper interpretation of empirical results. Overall, our ﬁndings indicate that phylodynamic analyses of early viral spread data must be carried out with care as data sets may not necessarily be informative enough yet to provide estimates robust to prior settings. It is necessary to do a robustness check of these data sets by scanning several models and prior distributions. Only if the posterior distributions are robust to rea- sonable changes of the prior distribution, the parameter estimates can be trusted. Such robustness tests will help making real-time phylodynamic analyses of spreading epidemic more reliable in the future. Key words: tree height; substitution rate; tree prior; molecular epidemiology; start of epidemic. et al. 2014; Pan American Health Organization and World Health 1. Introduction Organization Regional Office for the Americas 2015; Ventura et al. In February 2016, the WHO declared the current Zika virus (ZIKV) 2016; Zika Situation Report 2016). Although the ZIKV public health epidemic ongoing in the Americas to be a public health emer- emergency ended in November 2016, the ZIKV epidemic remains gency of international concern (World Health Organization 2016). ‘a highly significant and long-term problem’ according to the This emergency level was reached because of a possible link be- WHO (World Health Organization 2016). tween ZIKV infection and an increased number of microcephaly Bayesian phylodynamic approaches have proven very help- in newborns as well as between ZIKV infection and other neuro- ful in quantifying the spread of the ZIKV outbreak (Faria et al. logical disorders such as the Guillian-Barre´syndrome (Oehler 2016, 2017; Grubaugh et al. 2017; Metsky et al. 2017). Such V C The Author(s) 2018. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/ licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com Downloaded from https://academic.oup.com/ve/article-abstract/4/1/vex044/4829709 by Ed 'DeepDyve' Gillespie user on 21 March 2018 2 | Virus Evolution, 2018, Vol. 4, No. 1 analyses revealed that the current epidemic in the Americas to sample from the posterior distribution P½tree; parjseq; data. most likely goes back to one Asian lineage and was introduced Typically, we assume that seq is independent of data such that between May and December 2013 (Faria et al. 2016) first in P½seqjtree; par; data¼ P½seqjtree; par. The distribution P½seqjtree; par Brazil, from where it spread to other countries in the Americas is referred to as the phylogenetic likelihood and is specified by an (Metsky et al. 2017). The latter hypothesis is confirmed by the evolutionary model, and the distribution P[tree, datajpar] is referred observation that the epidemic in Florida, USA, seems to have to as the phylodynamic likelihood and is specified by a population started from several introductions, most of these introduced dynamic model. In the application of phylodynamics to epidemiol- lineages clustering with Carribean ZIKV sequences, inter- ogy, the sequencing data comes from pathogens obtained from dif- spersed within the bigger cluster of Brazilian sequences ferent hosts, and the population dynamic model is an (Grubaugh et al. 2017). epidemiological model. The term P½par is the prior distribution on In an emerging epidemic, sequence data might be limited the parameters of the population dynamic and evolutionary and may not have yet accumulated enough information in order models. to unequivocally quantify the phylodynamics, i.e. identify the If the sequence data contains a strong phylogenetic signal, underlying phylogenetic tree and the evolutionary and epidemi- the inferred tree topology will be independent of the prior speci- ological parameters. In fact, the following two aspects play a fications, i.e. the tree topology will be purely determined by dif- role in the outcome of such analysis. First, the tree prior may be ferences in the sequences. In addition to the tree topology, we very important when inferring the phylogeny using a Bayesian are typically interested in the tree height, i.e. the age of the framework. In particular, the impact of treating sampling times most recent common ancestor (MRCA) of all samples, and in all as given a priori vs. as part of the data has not been investigated other branching times of the tree. The clock rate is the parame- in a thorough analysis. Second, the interplay of the tree height ter of the evolutionary model that specifies how many substitu- and the clock rate prior in the case of limited empirical data tions a sequence accumulates per calendar time unit. It thereby may impact the tree inference in an unpredictable fashion. This translates the branches in the tree from units of substitutions to effect too, has not been explored in detail so far (but see Mo ¨ ller units of calendar time. If all sequences are sampled at the same (2017) for a thorough simulation study). time point, we can only estimate the product of the tree height In the present work, we explore the relevance of these two and the clock rate. If the data is sampled through time, i.e. the points for the estimation of the tree height based on data from tips of the tree are at different time points, we can in principle the ZIKV outbreak in the Americas, and compare the results ob- tease the two parameters apart (Korber et al. 2000). However, if tained with the Bayesian phylogenetics tool BEAST 2 to simpler there is not enough temporal signal in the data, it might not be regression-based and least squares-based techniques for tree possible to separate the clock rate from the tree height robustly. height inference. The tree height can be used to estimate the Little temporal signal will result in high sensitivity of the tree start of the epidemic. A robust estimate of when the epidemic height (and all tree branch lengths) to prior assumptions on the started is in particular necessary to investigate whether the in- clock rate and the tree height. crease of microcephaly cases in Brazil coincides with the While we can specify a clock rate prior directly through any ongoing ZIKV epidemic. Furthermore, we discuss the appropri- probability distribution, we can only indirectly specify the tree ateness of interpreting the reproductive number—a key epide- height prior through the population dynamic model. In phylo- miological quantity describing the intensity of epidemic dynamics, each population dynamic model is described by a so- spread—which is co-estimated with the tree height using the called tree prior. There are two classes of tree priors that have so-called birth–death model. proven especially useful: the coalescent (Kingman 1982; For the remainder of this section, we provide some back- Drummond et al. 2005), and the birth–death process (Nee 2006; ground regarding phylodynamics, introduce common tree priors, Stadler 2010). In the following paragraphs, we describe their and discuss their conceptual differences. These insights will be main assumptions and point out why the tree height cannot be important for interpreting the results presented in this paper. directly controlled through setting of the model priors. The coalescent describes the dynamics of the population giving rise to the tree backward in time, i.e. going from the tips 1.1 Phylodynamics and tree priors to the root. It is parameterized by the effective population size. The sampling times of sequences are assumed to be given a pri- The main idea of phylodynamic approaches is to quantify popu- ori. Given such a priori sampling times samp and an effective lation dynamic parameters based on the phylogenetic tree ob- population size N as a parameter par, the coalescent induces a tained from sequencing data (Grenfell et al. 2004). In practice, distribution of phylogenetic trees and a distribution of tree when using a Bayesian framework, the phylogenetic tree is not heights, fixed either, but co-inferred with the population dynamic as well as the evolutionary parameters based on the sequencing data (Drummond and Rambaut 2007; du Plessis and Stadler P ½treejpar ¼ N : samp e 2015). More formally, the Bayesian phylodynamic approach aims at inferring the posterior distribution P½tree; parjseq; data Given the sampling times, we therefore indirectly control where par is the vector of all parameters of the population dy- the tree height via the effective population size. namic and evolutionary models, seq is the sequencing data, and Birth–death models describe the dynamics of a population data denotes other potential sources of data. We can re-write, forward in time. They are parameterized by the birth (trans- mission) rate k, the death (become uninfectious) rate d,the P½tree;parjseq;data¼ P½seqjtree;par;dataP½tree;datajparP½par=P½seq;data: sampling probability p, and thetimeof the start of thepopula- tion (outbreak; also called origin time) T. Note that the repro- The normalizing constant P½seq; data cannot be calculated di- ductive number, a key epidemiological quantity (Anderson rectly, and thus Markov chain Monte Carlo methods are employed and May 1991), can be directly extracted from these Downloaded from https://academic.oup.com/ve/article-abstract/4/1/vex044/4829709 by Ed 'DeepDyve' Gillespie user on 21 March 2018 V. Boskova et al. | 3 parameters by simply dividing the birth rate by the death rate. 2. Methods The birth–death models give rise to a distribution of phyloge- 2.1 Data sets netic trees and in particular a distribution of sampling times and tree heights, On 3 November 2016 we gathered 252 full genome ZIKV consen- sus sequences. These consensus sequences stem from various P½tree; data ¼ sampjpar ¼ðk; d; p; TÞ: sources: we obtained 185 sequences from GenBank, 33 sequences from the Zibra project (https://github.com/zibraproject/zibrapro Thus, the upper limit of the tree height can be controlled, as it ject.github.io/tree/master/data/consensus), 17 sequences from the is strictly smaller than the origin time T, while the precise tree Andersen et al. github page (https://github.com/andersen-lab/ height is influenced by the birth, death and sampling parameters. zika-florida/tree/master/consensus_sequences), and 17 sequences Notice that while the coalescent conditions on the sampling from Ladner et al. github page (https://github.com/jtladner/ZIKA_ times when the tree is induced by the population size parame- Florida/tree/master/sequences). From this set, we removed ter, in the birth–death model the sampling times along with the sequences isolated before 2014, sequences isolated from other or- tree are induced by the birth–death parameters. This different ganisms than humans, sequences with missing date of isolation, conceptual usage of the sampling times samp under the two i.e. year and/or month information missing, sequences resulting approaches has important implications towards assessing the from vertical (mother to child) transmissions and sequences amount of information in the data regarding the tree height which appeared to be duplicates, i.e. sequences that seemed to with popular Bayesian phylodynamics softwares. A Bayesian have been isolated from the same individual on the same date phylogenetic analysis sampling from the posterior distribution but had different passaging histories or were sequenced by differ- of trees tree and parameters par using the coalescent conditions ent labs. This resulted in a set of 139 sequences. We refer to this on the sampling times samp: data set as the ‘ALL’ data set, because it includes sequences from different parts of the Americas. We also separately analysed a Coal P ½tree; parjseq/ P½seqjtree; parP ½treejparP½par; samp subset (sixty-seven sequences) of this full data set which only samp contained the sequences coming from Brazil (abbreviated as Coal BRAZIL). Additionally, we separately analysed a set of sequences where P ½treejpar is the probability density of a coalescent samp from Florida, USA (twenty-three sequences, abbreviated USA). tree tree given the sampling times samp. On the other hand, the These sequences formed a monophyletic cluster within the maxi- birth–death model uses the number of samples together with mum clade credibility (MCC) trees of the ‘ALL’ data set upon re- the associated sampling times samp and the sequence align- moval of one sequence from the Dominican Republic (Fig. 1 and ment seq as data: Supplementary Figs S1 and S2, see figure legend for a description of the model settings with which we obtained the MCC tree). For BD P½tree; parjseq; samp/ P½seqjtree; parP ½tree; sampjparP½par; the sake of simplicity, we refer to this cluster as monophyletic in the following. BD where P ½tree; sampjpar is the probability density of the sam- pling times samp together with the birth–death tree tree. Common practice for investigating the signal in the data is to run the Bayesian method ‘under the prior’ and then compare the 2.2 Phylogenetic and phylodynamic analyses results to those obtained from the full Bayesian analysis. In the strict sense, running an analysis ‘under the prior’ would mean We aligned the sequences using MUSCLE v3.8.31 (Edgar 2004). running the analysis when all data is ignored. With the coalescent We constructed trees with a maximum likelihood (ML) method approach, this means that sequencing data is ignored, while the PhyML v3.1 (Guindon et al. 2010) under the HKY substitution number and the time of sequence samples is used, as it is not model, estimating the equilibrium frequencies, and starting the considered to be a part of the data but rather a priori information. ML tree search using ten different random seeds. We assessed In contrast, with the birth–death model, the sequences, the num- the clock-like behaviour of the data using TempEst v1.5 ber of sequences as well as the sequence sampling times are ig- (Rambaut et al. 2016), by looking at the correlation of the sam- nored when running the Bayesian method ‘under the prior’. This pling times with the root-to-tip divergence, optimising the posi- stems from the fact that the birth–death model parameters in- tion of the root of the ML tree by maximising the correlation duce a distribution of the number of sequences as well as the se- metric. From the ML trees, we also estimated the time of the quence sampling times, meaning any information we have in MRCA in calendar time units using LSD v0.3 (To et al. 2015), us- that respect during the analysis is data. ing options -c (constrained mode, i.e. imposing that the node is Many users perform phylodynamic analyses with the soft- always older than any of its descendants) and -r as (searching ware package BEAST 2 (Bouckaert et al. 2014). The graphical for root node on all branches, using constraints). Finally, we user interface BEAUti allows the specification of model priors analysed the sequences using the Bayesian phylogenetic soft- and input data. It also allows the user to specify if ‘sampling ware tool BEAST v2.4.2 (Bouckaert et al. 2014). For all analyses, from prior’ should be performed. If this option is chosen, the se- the tip dates were fixed to the dates found in the sequence an- quencing data, but not the sampling times, are ignored. For the notation. For those sequences with the day-of-sampling infor- coalescent models this means that one obtains the prior distri- mation missing, we simply set the date to be year/month/15 butions. However, for the birth–death models this means that (total of 6/139 sequences). The results are consistent with those one obtains a posterior distribution of trees and parameters us- obtained from the analyses where the samples with incomplete ing sampling times data, while only the sequencing data is ig- sampling date information were excluded. nored. The analysis with the birth–death model when opting for For the Bayesian analyses, we need to specify two modelling ‘sampling from prior’ in BEAST 2 is therefore not equivalent to components, namely the sequence evolution model and the the analysis ‘under the prior’ in the usual Bayesian sense. tree prior model, which we specified as follows. Downloaded from https://academic.oup.com/ve/article-abstract/4/1/vex044/4829709 by Ed 'DeepDyve' Gillespie user on 21 March 2018 4 | Virus Evolution, 2018, Vol. 4, No. 1 Figure 1. MCC tree of the 139 ZIKV sequences included in this study. The posterior clade support is displayed at each branching point. The names of all virus sequences that were isolated in Florida, USA, are highlighted in green and all sequences from Brazil are highlighted in magenta. The cluster of sequences highlighted in green rep- resents the twenty-three strains of the USA data set that form one monophyletic cluster upon exclusion of one non-USA sequence. The MCC tree was obtained using the BDSKY model in BEAST 2 with d ¼ 18.25, three intervals for R and a relaxed clock (see Supplementary Figs S1 and S2 for other model parametrizations). 2.2.1 Sequence evolution model settings 2.2.2 Tree prior model settings For all analyses we chose the HKY substitution model as the se- For the tree prior, we used the birth–death skyline model quence evolution model with substitution rate fixed to 1, LNð1; (Stadler et al. 2013) and the coalescent Bayesian skyline plot 1:25Þ prior and 0 as the lower limit for the j parameter, and esti- (Drummond et al. 2005). The birth–death skyline model is a mated base frequencies and C-distributed variation between birth–death model with constant parameters within intervals sites, using four categories, the shape parameter prior being set to through time while parameters may change across intervals. Exp(1). We allowed for between-branch rate variation using a re- We used one interval for the become uninfectious rate (i.e. the laxed clock model (Drummond et al. 2006). We used the log-nor- ‘death’ rate) and fixed its value to 14.26, 18.25, or 23.40 (corre- mal relaxed clock model with ucldMean prior being LNð0:001; 2Þ in sponding to the inverse of the lower 95-percentile, mean and real space. Furthermore, the ucldStdev prior was set to Cð0:5396; upper 95-percentile of the estimates of the mean ZIKV genera- 0:3819Þ with 0 as the lower limit. This resulted in an effective prior tion time obtained by Ferguson et al. (2016)). The become unin- on the clock model’s rate.mean to have mean of 0.001 and me- fectious rate is the inverse of the time period of being dian of 0.0001. This prior was chosen to reflect the estimates of infectious. All the values of the become uninfectious rate here previous studies, where the substitution rate was estimated to be are in units per year. Furthermore, we used five equidistant in- between 0:98 1:06 10 subst/site/year (note that this unit cor- tervals between the first and last sample for the sampling prob- responds to ‘substitutions per site and year’ or subst site - ability. For each interval we set the prior distribution to –1 year )(Faria et al. 2016) but was suspected to vary between Beta(1, 700). The sampling probability before the first sample –4 –3 2.610 and 4.410 subst/site/year depending on the host or- was set to 0. For the effective reproductive number, R (which is ganism (virological.org 2016). We also performed analyses using the ‘birth’ rate divided by the ‘death’ rate), we used three, four, a strict clock with prior LNð9:2; 2:3Þ for the rate parameter. Note or six equidistant intervals between the root and the last sam- that in the Bayesian framework, the product of the substitution ple with LN(0, 1) prior distribution and 0 as the lower and 50 as rate from the substitution model and the clock rate from the clock the upper bound. For the origin parameter, the prior was LNð5; model is the overall substitution rate of the process. Since we 0:5Þ in real space and lower and upper bounds of 0 and 15, re- fixed the substitution rate in the substitution model to 1, and we spectively. The coalescent Bayesian skyline model is based on a only estimate the clock rate, the clock rate we report effectively coalescent with constant effective population sizes within in- refers to the overall substitution rate. tervals through time while the effective population size may Downloaded from https://academic.oup.com/ve/article-abstract/4/1/vex044/4829709 by Ed 'DeepDyve' Gillespie user on 21 March 2018 V. Boskova et al. | 5 change across intervals. In the coalescent Bayesian skyline plot (Kampstra 2008). Lastly, MCC trees were obtained using we set the prior for the effective population size to a Markov- TreeAnnotator v2.3.0 (Rambaut and Drummond 2015), removing chained distribution, i.e. the prior value of the parameter in the initial 10% of samples as burn-in, setting posterior probability interval follows a Gamma distribution with mean being set to limit to 0.1 and reconstructing node height based on Common the value in the previous interval. We set the prior distribution Ancestor heights criterium. The MCC trees were visualised us- of the population size in the first interval to Unifð0; 10 Þ. We set ing FigTree v1.4.0 (Rambaut 2012). The BEAST 2 source files and the number of intervals for the effective population size and analysis scripts can be found in the Supplementary Materials group size to 3, 4, or 6. and Methods zip file. In summary, for the three ZIKV data sets, we performed analyses under relaxed and strict clock models, using the coa- 3. Results lescent and birth–death models with a range of settings as sum- 3.1 Robustness analysis of the tree height and the marized in Table 1. We ran the chain for 10 steps both with and clock rate without sequencing data (sampling from prior). We sampled pa- 4 6 rameters every 10 steps and trees every 10 steps. The log files First, we set out to estimate the start of the epidemic and the were inspected in Tracer (Rambaut et al. 2014). All parameters clock rate of the ZIKV epidemic for (1) the complete data set in all the runs mixed well (ESS >200) with one exception, containing 139 ZIKV sequences sampled in the Americas (ALL), namely strict clock, all data, birth–death (BD): 4R LN(0, 1), (2) 67 sequences sampled only in Brazil (BRAZIL), and (3) 23 se- without sequences, in Supplementary Fig. S3B (tree height and quences belonging to the locally contained outbreak in Florida, origin had ESS of only 159 and 187, respectively). USA (USA). All data sets contain sequentially sampled se- For the analyses investigating the evolution of R over time quences and therefore should in principle allow for calibration in Florida, USA, we used the sequences of the Florida monophy- of the molecular clock (i.e. estimating the clock rate) and reli- letic cluster. We fixed the ucldMean parameter to l ¼ 9 10 able estimation of the tree height. We used three methods for subst/site/year, which was the mean and the median estimate estimation of the clock rate and the tree height: a regression of the clock rate (rate.mean parameter in BEAST 2) we obtained model correlating sampling time to the root-to-tip divergence when analysing the ALL data set without Florida, USA se- (PhyML combined with TempEst), a least squares-based ap- quences (see Supplementary Fig. S4, w/o Florida). Furthermore, proach (PhyML combined with LSD) and Bayesian phylody- in these analyses we varied the number of intervals for R to be namic approaches (BEAST 2). For the latter, we used two 3, 4, or 6 and the number of intervals for sampling probability p different models, the birth–death skyline model (Stadler et al. to be 1 or 5. 2013) and the coalescent Bayesian skyline plot (Drummond Post-processing of Bayesian analyses was done by first dis- et al. 2005). carding the initial 10% of the MCMC samples as burn-in. Results The start of the epidemic is parameterized in the BD model were then analysed in R (R Core Team 2013) using custom-made 0 as T, and the tree height T is the age of the MRCA. As in the scripts with the R packages boa (Smith 2007) for calculating the Bayesian coalescent (Coal) framework and the two non- highest posterior density (HPD) intervals, and RColorBrewer 0 Bayesian frameworks, only the tree height T rather than T is es- (Neuwirth 2014) for plotting. To obtain the lineages-through 0 timated, we compare the estimated tree height, T , under all time (LTT) plot, the R package ape (Paradis, Claude, and methods, using it as an approximation for the start of the epi- Strimmer 2004) was used in combination with the function demic, T. In what follows, we will refer to the ‘time of the MRCA’ LTT.plot.gen in the R package TreeSim (Stadler 2011). To extract (tMRCA) as the ‘date when the epidemic started’. This quantity the node (tree) height from the tree necessary for plotting the is derived by subtracting the tree height from the date of the LTT along with the R plots, we used R package phytools (Revell most recent tip in the data set. The rationale for comparing 2012). We used the R package beanplot for plotting the probabil- non-Bayesian to Bayesian estimates of the tree height (or of the ity densities of time of the MRCA and clock rate estimates date when the epidemic started) and the clock rate is to assess Table 1. Overview of models used in this study. Phylodynamic model Parameter Value Evolutionary Clock model model Birth–death skyline model with Effective reproductive Estimated; 3, 4, 6 intervals (3 R ,4 HKYþC Relaxed clock serial sampling (BD) number (R ) R ,6 R ) with LN prior with various e e e parameterizations Become uninfectious rate (d) Fixed; Strict clock d 2f14:26; 18:25; 23:40g Sampling probability (P) Estimated; with prior Beta(1, 700) Coalescent Bayesian skyline Effective population size (N ) Estimated; HKYþC Relaxed clock plot (Coal) 3, 4, 6 intervals (3 N ,4 N ,6 N ); with prior on e e e ﬁrst interval Unifð0; 10 Þ, following Strict clock intervals: C with mean population size of previous interval The abbreviations used in Figs 3 and 4 and Supplementary Figs S3 and S4 refer to this scheme. For example, BD: 3R LN(0, 1), d ¼ 18:25 refers to the birth–death skyline model with serial sampling which allows for three intervals for R , each with LN(0, 1) distributed prior, and the become uninfectious rate d ¼ 18.25. Downloaded from https://academic.oup.com/ve/article-abstract/4/1/vex044/4829709 by Ed 'DeepDyve' Gillespie user on 21 March 2018 6 | Virus Evolution, 2018, Vol. 4, No. 1 the amount of information available in the data without adding The samples in the ALL and BRAZIL data sets range over more prior information. than 1.37 years whereas the sequences in the USA data set were only collected during less than four months. The x-intercept of the regression line between the calendar time (x-axis) and the 3.1.1 Regression analysis of the tree height and the clock rate root-to-tip divergence (y-axis) can be used as an approximation To obtain the estimate of the phylogenetic tree, we applied the of the date when the epidemic started. This way, the start of the ML method PhyML v3.1 (Guindon et al. 2010) to the sequences. epidemic in the Americas was estimated to be 12 June 2013. The We assessed the clock signal in the data using TempEst v1.5 Brazilian epidemic was estimated to start 1.6 years earlier, i.e. (Rambaut et al. 2016)(Fig. 2). All three data sets show a positive 30 October 2011. The epidemic in Florida was estimated to have correlation between genetic divergence (root-to-tip divergence) –3 started on 31 May 2016, i.e. three years after the tMRCA of the and sampling time, yielding estimates of clock rate of 1.57 10 –3 ALL data set (see Fig. 2). These results indicate that the clock sig- subst/site/year for the ALL data set, 1.76 10 subst/site/year –3 nal may not be strong enough in (some of) the data sets, as the for the BRAZIL data set, and 6.42 10 subst/site/year for the ALL data set cannot have a younger tMRCA than any of its sub- USA data set. The ALL data set exhibits the strongest associa- data sets. In addition, the estimates for the ALL and USA data tion between the genetic divergence and the sampling time 2 sets are very close to the earlier obtained estimates of the intro- (R ¼ 0:32). The BRAZIL and USA data sets show more diffuse 2 duction of ZIKV into the Americas; however, the estimate for 2 2 patterns with lower R -values: BRAZIL: R ¼ 0:19, USA: R ¼ 0:26. Figure 2. TempEst estimates of tree heights and clock rates. The tree height and the clock rate (i.e. slope) estimates obtained using TempEst for the three data sets: (A) the complete data set (ALL), (B) the sequences isolated in Brazil (BRAZIL), and (C) the (monophyletic cluster of) sequences isolated in Florida (USA). The tree for each data set was reconstructed using the ML method. Downloaded from https://academic.oup.com/ve/article-abstract/4/1/vex044/4829709 by Ed 'DeepDyve' Gillespie user on 21 March 2018 V. Boskova et al. | 7 the BRAZIL data set seems to be much older than estimated be- branches (in BEAST 2, this is the rate.mean parameter). Ignoring fore (Faria et al. 2016, 2017; Grubaugh et al. 2017). the sequences, we obtain the chosen prior distribution for the clock rate under both the BD and the Coal framework as ex- pected (gray distributions in Fig. 4). 3.1.2 Least squares analysis of the tree height and the clock rate The ML tree was also analysed with the least squares dating 3.1.3.2 How much additional information do the sequences contain tool LSD v0.3 (To et al. 2015), in order to estimate the tree height and the clock rate in the ZIKV epidemic. The inferred tree height concerning the tree height and the clock rate? When performing phylogenetic analyses for the ALL data set including the se- for the ALL data set resulted in the estimated date when the epi- demic started to be 8 January 2012. For the Brazilian data set, quence data, we obtain relatively consistent estimates of the median tree heights and clock rates in all models and model the start of the epidemic was estimated to be 28 July 2010. The oldest tree and thus the most ancient start of the epidemic, dat- specifications (thick green lines in Fig. 3 and thick red lines in Fig. 4, respectively). The median estimates of the tree heights ing to 18 October 1980 was estimated for the USA data set. The clock rate estimates for the ALL, BRAZIL, and USA data sets translate to the estimated start of the epidemic ranging from 25 –4 –4 –5 August 2013 to 22 December 2013. The range of median esti- were 5.21 10 , 5.62 10 , and 3.46 10 subst/site/year, re- 3 3 spectively. Again, these results indicate that the clock signal is mates of the clock rate is 1:01 10 1:24 10 subst/site/ year. weak in (some of) our data sets. In the BD analyses with various numbers of intervals for the effective reproductive number R , all 95% HPD intervals of the 3.1.3 Bayesian phylodynamic analysis of the tree height and the tree height include the tree height estimate obtained by clock rate TempEst (12 June 2013)—Fig. 3. This finding is robust also when In the next step, we used phylodynamic analysis to overcome we assume a slower become uninfectious rate, but not for other the problem of point estimates imposed by PhyML combined variations of priors for BD model parameters that we tested, i.e. with TempEst and LSD. In a nutshell, we used two different when assuming a faster become uninfectious rate or when set- models, the birth–death skyline model (Stadler et al. 2013) and ting a strong prior on very large or an unrealistically low R in the coalescent Bayesian skyline plot (Drummond et al. 2005). e the BD model (Supplementary Fig. S3A, BD: 3R LN(2, 0.2), The term ‘skyline’ refers to the parameters (birth, death, sam- e d ¼ 18.25 and BD: 3R LN(–2, 0.2), d ¼ 18.25). In the Coal analy- pling, or effective population size) changing in a piecewise con- e ses with various numbers of intervals for the effective popula- stant fashion through time. As laid out in the introduction, the tion size N , all 95% HPD intervals of the tree height include the two models differ conceptually in how data is used, and as a e tree height estimate obtained by TempEst (12 June 2013). This common practice, the models are run without sequence data as finding is robust for different priors on the effective population well as with all sequence data to determine how much informa- size (Supplementary Fig. S3A). Only an extremely small coales- tion the sequence data contains. Therefore, we will first explain cent population size priors in the Coal model (Supplementary the estimation of the tree height and the clock rate without se- Fig. S3A, Coal: 3 N LN(–2, 0.2)) produce HPDs not including quences and then include the sequence data. e the TempEst estimates. Since in the BD model the effective reproductive number, Neither the BD nor the Coal 95% HPD intervals of the clock the become uninfectious rate, and the sampling probability are –3 rate estimates contain the TempEst estimate of 1.57 10 simultaneously unidentifiable (Stadler et al. 2013; Boskova, subst/site/year (Fig. 4), although they are in the same order of Bonhoeffer, and Stadler 2014), we fixed the become uninfectious magnitude (10 subst/site/year). This is consistent for all mod- rate in all analyses. Ferguson et al. (2016) estimated the inverse els (Supplementary Fig. S4A). The median clock rates estimated of this rate, i.e. the mean ZIKV generation time with its lower in a Bayesian framework are consistently lower than the and upper 95-percentile. For our main analyses, we used their TempEst estimate (Fig. 4 and Supplementary Fig. S4A). Only a mean estimate translating to the become uninfectious rate of prior for extremely small coalescent population size pushes the 18.25, and for the Supplementary analyses we used the upper clock rate estimate to a higher rate (Supplementary Fig. S4A, 95-percentile and the lower 95-percentile of the Ferguson et al. Coal: 3 N LN(–2, 0.2)). estimates, i.e. we set the become uninfectious rate to d ¼ 14:25 e None of the HPD intervals contain the estimates of the tree and 23.40, respectively. height and the clock rate provided by LSD for any of the prior set- tings. The LSD clock rate estimates are consistently lower than 3.1.3.1 What can we learn about the tree height and the clock rate the BD and the Coal estimates. The LSD tree height estimates are when ignoring the sequencing data? When analysing the data sets always above the BD/Coal tree height estimates (see Figs 3 and 4, using the BD model with information on sampling times and and Supplementary Figs S3 and S4). number of samples only, i.e. ignoring the sequence data, we ob- In contrast to the ALL data set, the analysis of the BRAZIL tain very peaked tree height distributions (gray distributions in and USA data sets including the sequence data leads to distribu- Fig. 3). Thus, the number of samples and sampling times con- tions of the tree height and the clock rate that varied strongly tain a lot of information regarding the tree height. The medians amongst the models (Figs 3 and 4 and Supplementary Figs S3A of these distributions depend on different model specifications. and S4A). In particular, the 95% HPD intervals of the tree height When analysing the data sets using the Coal model ignoring the distributions for the BRAZIL and the USA data sets are very sequence data, i.e. running the analysis under the prior, the tree broad (Fig. 3). For the Coal model, the posterior distribution re- height distributions are very wide as expected since the sam- mains very flat (i.e. does not change much from the prior). pling times are conditioned upon under the coalescent (gray These results indicate that the BRAZIL and USA data sets gath- distributions in Fig. 3). ered in November 2016 do not contain a strong enough signal to We performed all analyses shown in Fig. 3 using a relaxed ultimately estimate the tree height reliably using phylodynamic molecular clock that allows for variation of the clock rates be- analysis. tween the different branches of the phylogenetic tree When analysing the BRAZIL and USA data sets separately, (Drummond et al. 2006). The clock rate for the relaxed clock we obtain very different estimates for the clock rate from the BD model is defined as the mean clock rate averaged over all Downloaded from https://academic.oup.com/ve/article-abstract/4/1/vex044/4829709 by Ed 'DeepDyve' Gillespie user on 21 March 2018 8 | Virus Evolution, 2018, Vol. 4, No. 1 Figure 3. The effect of addition of sequence data on the tMRCA estimates in the Bayesian analysis. The probability distribution of the estimates of the tMRCA resulting from the analysis with (green) and without (gray) sequences is shown. The ﬁgure shows estimates obtained under various model assumptions (labels on the x-axis summarize the models as explained in Table 1) and the three different data sets (header). The median date of MRCA is indicated with a thick solid line and the 95% HPD intervals are marked with thin solid lines. The black dashed, dotted, and dashed-dotted lines represent the tMRCA estimates based on the TempEst analysis (Fig. 2) for the ALL, BRAZIL, and USA data sets, respectively. The gray dashed, dotted, and dashed-dotted lines represent the tMRCA of the LSD estimates for the ALL, BRAZIL, and USA data sets, respectively. The become uninfectious rate in the BD model is set to d ¼ 18.25, which is the mean estimate in Ferguson et al. (2016). Notice that the median estimates of the tMRCA for the BRAZIL and USA data sets analysed with the Coal model are beyond the limits of the ﬁgure, so we state the estimated tree heights below. For the BRAZIL data set, the median tree height estimate of distributions resulting from analyses without sequence data for Coal: 3 N C is 6 6 5 1.8 10 years, for Coal: 4 N C is 1.0 10 years and for Coal: 6 N C is 3.5 10 years. The median tree height estimate of the distribution when sequence data is e e included for Coal: 3 N C is 1026.7 years, for Coal: 4 N C is 960.1 years, and for Coal: 6 N C is 1039.8 years. For the USA data set, the median tree height estimate e e e 6 6 of distributions resulting from analyses without sequence data for Coal: 3 N C is 2.0 10 years, for Coal: 4 N C is 1.2 10 years, and for Coal: 6 N C is e e e 5.3 10 years. The median tree height estimate of the distribution when sequence data is included for Coal: 3 N C is 460.8 years, for Coal: 4 N C is 449.1 years, e e and for Coal: 6 N C is 443.5 years. models and the Coal models (Fig. 4). In the BD models, the se- agreement with the TempEst analyses. It remains to be investi- quence data is able to bring the clock rate estimates closer to gated though why LSD produce outliers. However, the BRAZIL the TempEst estimates although the posterior distributions re- and the USA data sets do not contain enough information to main quite diffuse. We do not see a consistent pattern of the perform reliable estimation of tree height and clock rate param- clock rate estimates getting closer to the LSD estimates how- eters of these sub-epidemics. ever. Adding sequence data to the Coal models leads to clock Through the inclusion of the sequence data in the analysis, rate estimates that are even lower than expected from the prior the clock rate and the tree height priors are allowed to interact. (Fig. 4) and thus further away from the TempEst and LSD If the data contains enough signal, the clock rate and the tree estimates. height can be teased apart. However, if the data has a weak sig- In our main analyses we used the relaxed clock model. nal for calibrating the clock, then both the clock rate and the Further, we repeated the analyses with a strict clock model tree height estimates may be biased, and the nature of this bias (Supplementary Figs S3B,C and S4B,C). Employing the strict would be influenced by the interaction of the two prior clock models, we estimated the clock rates and tree heights to distributions. be very similar to those inferred by the relaxed clock models. If there is very little signal in the data for separation of the From our analyses we conclude that the signal contained in clock rate and the tree height, only their product can be estimated the ALL data set is sufficient to relatively consistently estimate reliably. The median of the product of the clock rate and the tree the tree height and the clock rate using Bayesian methods, if height for the BD analyses with three intervals for R for the ALL the priors on the tree height are not too strongly biased towards data set is 0.00291 (95% HPD¼(0.00250, 0.00335)), for BRAZIL extremely early or late introductions. Therefore, one can date 0.00270 (95% HPD¼(0.00192, 0.00347)) and for USA 0.00127 (95% the introduction of ZIKV into the Americas around late 2013. HPD¼(0.00081, 0.00170)). For the Coal analyses with three intervals The posterior distributions of the time of introduction are in for N , we obtain following median estimates of this product: ALL Downloaded from https://academic.oup.com/ve/article-abstract/4/1/vex044/4829709 by Ed 'DeepDyve' Gillespie user on 21 March 2018 V. Boskova et al. | 9 Figure 4. The effect of addition of sequence data on the clock rate estimates in the Bayesian analysis. The probability distribution of the clock rate (rate.mean parame- ter) estimates resulting from the analysis with (red) and without (gray) sequences is shown. The ﬁgure shows estimates obtained under various model assumptions (la- bels on the x-axis summarize the models as explained in Table 1) and the three different data sets (header). The median clock rate is indicated with a thick solid line and the 95% HPD interval marked with thin solid lines. The black dashed, dotted, and dashed-dotted lines represent the clock rate estimates based on the TempEst analysis (Fig. 2) for the ALL, BRAZIL, and USA data sets, respectively. The gray dashed, dotted, and dashed-dotted lines represent the clock rate of the LSD estimates for the ALL, BRAZIL, and USA data sets, respectively. The become uninfectious rate in the BD model is set to d ¼ 18.25, which is the mean estimate in Ferguson et al. (2016). The clock rate displayed is in units of s/s/y, i.e. subst/site/year. 0.00362 (95% HPD¼(0.00299, 0.00441)), BRAZIL 0.00239 (95% particular data set was isolated on 10 July 2016, the start of the HPD¼(0.00157, 0.00360)), and USA 0.00117 (95% HPD¼(0.00079, epidemic would then be (2016:52 2:7 years) October 2013. 0.00163)). For both BD and Coal models, the ALL data set has the However, our clock rate prior has a median of the rate.mean pa- –4 highest product, followed by BRAZIL and finally USA. This order- rameter set to 1.3 10 subst/site/year, which, when combined ing makes sense: the full data set has an underlying tree as old as, with the estimated product of the clock rate and the tree height, 0:00270 or older than, the tree obtained from any subset of the full data would lead to tree height estimates of ¼ 20:77 years, i.e. 0:00013 set. Assuming all the data sets share the same clock rate, the the start of the epidemic would be October 1995. Thus, if the product of tree height and clock rate should be largest for the ALL sequencing data does not contain enough information to cali- –3 data set. Also, the 95% HPD intervals of this product overlap for brate the clock to the expected 1 10 subst/site/year, the in- the Coal and the BD analyses in all three data sets. ferred tree height will increase, i.e. the tMRCA will be pulled up Having in mind that we can estimate this product from our from April 2014 and beyond October 2013, upon inclusion of the data, we can now understand better the inconsistencies ob- sequence data. This insight can explain our large tree height tained when aiming to separate the clock rate and the tree (median estimate of 4.7 years, i.e. the start of the epidemic height using sequencing data. We illustrate the reason for the being October 2011), and the slow posterior clock rate (median –4 inconsistencies using the BRAZIL data set. The BD: 3 R model estimate of the rate.mean parameter being 5.8 10 subst/site/ produces a peaked estimate of the tree height when the tree year which is much slower than estimates by other studies prior is combined with the sampling times only (gray distribu- (Faria et al. 2016)). The situation for the other prior settings for tion in Fig. 3). The median estimate of 2.2 years (start of the BRAZIL and USA data set analysed with the BD model is epidemic April 2014) is close to the previously reported esti- analogous to what we have just described for the BRAZIL BD: mates of the start of the ZIKV epidemic in the Americas 3 R model. (Faria et al. 2016). So why does adding sequence data shift the When the BRAZIL data set is analysed under the Coal: 3 N estimated date of when the epidemic started further into the model, the median estimate of the product of the clock rate and past? The median estimate of the product of the clock rate the tree height is 0.00239 subst/site, which is very similar to the and the tree height is 0.00270 subst/site. If the true clock rate one obtained with the BD model. Again, if we assume the true 3 3 was 1 10 subst/site/year as estimated by Faria et al. (2016), clock rate to be 1 10 subst/site/year, as estimated before the tree height would be 2.7 years. Given the last sample in this (Faria et al. 2016), the tree height would be 2.4 years and the Downloaded from https://academic.oup.com/ve/article-abstract/4/1/vex044/4829709 by Ed 'DeepDyve' Gillespie user on 21 March 2018 10 | Virus Evolution, 2018, Vol. 4, No. 1 start of the epidemic would be estimated to be in February The number of secondary cases caused by a single infected indi- 2014. Our clock rate prior has a median of the rate.mean param- vidual at time t is referred to as the effective reproductive number, –4 eter set to 1.4 10 subst/site/year, leading to estimated me- R , at time t. Similarly to R , the effective reproductive number e 0 dian tree height of 17.07 years, i.e. a start of the epidemic being provides important information on whether the epidemic is in June 1999. However, unlike in the BD model, the tree height able to spread further—this is the case if R > 1, or whether it is prior is very diffuse under the coalescent. The median prior tree prone to die out—i.e. if R < 1. height is 1:8 10 years. So if the data is unable to provide cali- The effective reproductive number during an ongoing epi- bration information for the clock rate, the inferred tree height demic is directly estimated under the birth–death skyline model would increase a lot and the tMRCA will be pulled up from as the birth rate divided by the death rate at time t. One of the February 2014 due to the tree prior. Indeed, when the sequence assumptions of the birth–death skyline model is that the sam- data is included in the analysis, the median tree height estimate pling intensity may vary through time but it is distributed uni- turns out to be 1026 years. The tree height gets pulled down formly at random across the considered infected population at from what would be dictated by the tree prior. This estimate is any point in time. This assumption is violated for the ALL and however still higher than what would be expected under the BRAZIL data sets, since they encompass a big geographic area, clock rate prior (or the clock rate estimates in other studies), re- with sampling and sequencing efforts being higher in some re- quiring the clock to be even slower than defined by the prior gions than in others (Faria et al. 2017; Metsky et al. 2017). In con- (median estimate of the rate.mean parameter after inclusion of trast, the sequences in the Florida, USA data set were sampled sequences being 2:4 10 subst/site/year). The situation for across a smaller area formed by three neighbouring counties other prior settings for BRAZIL and also when using the Coal (Grubaugh et al. 2017) with the zone of transmission being esti- model with the Florida, USA data set is similar. mated within these counties for most individuals (only for three These findings show that the data sets of BRAZIL and USA of twenty-three sequences included in our data set had an un- sequences do not contain enough information to calibrate the known zone of transmission). In addition, the infected individ- molecular clock and thus fail to infer a proper tree height. uals in this region seemed to be sampled uniformly at random at any point in time. As we cannot estimate the clock rate for this data set, we set 3.2 Estimates of the effective reproductive number for it to the median (identical to mean) clock rate estimate we ob- the USA data set tained from analysing the full data set without the twenty-nine Florida sequences (i.e. the sequences highlighted in green in The number of secondary cases induced by one index case, i.e. the number of individuals infected by one infected individual Fig. 1 and Supplementary Figs S1 and S2). Thus, we set the clock –4 rate to 9 10 subst/site/year. We show the results in Fig. 5 us- during his/her infectious period, in a totally susceptible popula- tion is referred to as the basic reproductive number R .It is a ing six intervals for the birth rate (and thus R ) and assuming a 0 e constant sampling probability throughout the period of sample very important parameter to be estimated at the start of an epi- demic. If R < 1 the epidemic is prone to die out, if R > 1 the collection. This analysis reveals that for up to the time of the 0 0 last sample, i.e. until October 2016, the ZIKV spread fastest in disease will spread amongst the population (Anderson and May 1991). During an ongoing epidemic, some individuals recover Florida in mid-2016 and that the R dropped below 1 around August 2016. Allowing fewer intervals for the birth rate, mean- from the disease and gain immunity such that the population is not completely susceptible anymore. Further, the individuals ing only estimating a coarser pattern for R , yields R estimates e e which are averages from the finer six interval analysis, as ex- might change their behaviour and thus the number of second- ary cases caused by a single infected individual might change. pected (Supplementary Fig. S5). Figure 5. Birth–death skyline plot based on the USA data set. We used the BDSKY model allowing six intervals for R and 1 mean value for the sampling probability. Opaque coloring for the effective reproductive number, R (orange), and the sampling probability (cyan), depict the results of analyses without sequence data, the darker shades display estimates after the addition of sequence data. The vertical dashed magenta lines and crosses indicate the sampling time points of the viral se- quences. The magenta curve summarizes the lineages-through-time plot averaged over all MCMC samples without the burn-in (ﬁrst 10% of samples). For all parame- ters we show the 95% conﬁdence intervals and the median estimates (solid central line). Downloaded from https://academic.oup.com/ve/article-abstract/4/1/vex044/4829709 by Ed 'DeepDyve' Gillespie user on 21 March 2018 V. Boskova et al. | 11 We were interested to compare our R estimates to the prev- case when the sampling periods are too short or when sampling alence and incidence data reported by Pan American Health is biased, i.e. when sampling is confounded with the genetic Organization (2017) and the Florida Department of Health structure in the population (Murray et al. 2016). Repeating the (2017). However, these statistics on local transmissions only be- analyses by randomly permuting the sampling dates of the se- gin on 18 August 2016 and 30 July 2016, respectively. Both sour- quences, i.e. doing a Bayesian permutation test (Ramsden et al. ces report a drastic increase in new cases in the first half of 2008), can identify the latter but not the former bias (Duche ˆne October 2016. Since in the Florida cluster we analysed here, the et al. 2015). In contrast, a version of the Bayesian permutation sequences are sampled as early as in June 2016, we observe an test, the so-called clustered permutation test, which random- increase in R before the first reports of local transmission from ises the date assignment between sequences belonging to a PAHO/Florida Department of Health. Also, our last sample was monophyletic cluster, rather than doing the assignment com- obtained on 11 October 2016 and we therefore do not capture pletely at random, can deal with both situations when assessing the potential peak in R beyond (nor potentially slightly before) the amount of clock signal in the data (Duche ˆ ne et al. 2015). this date. However, Grubaugh et al. (2017) provide an estimate In our ZIKV data analyses, the tree height estimates from of the R in the Miami Dade county, one of the three counties TempEst are largely in agreement with the Bayesian estimates the sequence data included in our analyses comes from. Based for the ALL data set. Using LSD method, we obtain conflicting on the observed local transmission data and the total number tree height estimates compared to the Bayesian and TempEst of introduction events (travel-associated cases), they estimated estimates already for the ALL data set, which may be due to LSD the R to be 0.5–0.8. In our analyses, once we see the epidemic needing stronger clock signal in the data to produce reliable spread taking off in early March 2016, we estimate median R of estimates. 0.88, quickly rising above 1 as the epidemic progresses. Between Both TempEst and LSD provide only point estimates for the end of May and mid-July 2016 we consistently observe R > 1: clock rate and the tMRCA parameters. To explore the robustness median R between 2.1 and 2.2 and 95% HPD¼(1.1, 3.1). Prior to of these estimates, and to obtain the confidence intervals, indi- March 2016 we do not see the epidemic spread and estimate rect approaches such as a bootstrap procedure has to be per- median R between 0.6 and 0.88 (95% HPD: 0.05–1.8), which formed. Furthermore, these estimates are based on a single would be in line with Grubaugh et al. estimates. The discrep- input tree. In case of limited data, the tree reconstruction ancy between our R estimates after March 2016 and the R esti- method and settings, e.g. evolutionary model, chosen can e 0 mates of Grubaugh et al. could stem from the estimates of strongly influence the results of these analyses (compare Fig. 2 Grubaugh et al. being based on the infected patient count data where the input three is reconstructed using ML procedure, and reported by the public health agencies, which could have poten- Supplementary Fig. S6, where the tree has been reconstructed tially missed a lot of mild or asymptomatic ZIKV infections. using another popular method, namely neighbour-joining algo- rithm (Saitou and Nei 1987). 4. Discussion 4.2 Analyses using Bayesian phylodynamic method As soon as a new virus starts to spread within the human popu- Bayesian methods are an alternative to the point estimate lation, it is important to estimate the potential impact of this methods. In the Bayesian framework, the uncertainty in the epidemic on a global scale. To obtain trustworthy parameter es- tree topology can be naturally integrated out by sampling over timates, the sequences must contain enough signal regarding many plausible topologies. In addition, the result of a Bayesian the quantities of interest. Testing the sequence data for a signal inference is a distribution of parameter values, rather than a on the molecular clock rate is therefore necessary. single point estimate, providing a measure of uncertainty in the estimate. Furthermore, results can easily be tested for robust- 4.1 Analyses using point estimate methods ness across models and across priors for parameters to investi- Point estimate methods allow for dating the tree and obtaining gate the amount of signal contained in the data about any an estimate for the molecular clock rate. Here, we explored such particular parameter. estimates using TempEst (Rambaut et al. 2016) and LSD tree dat- Bayesian phylodynamic analysis requires model choices and ing (Guindon et al. 2010; To et al. 2015). Tools such as TempEst decisions on prior settings before analysing the sampled se- (Rambaut et al. 2016) that visualise the regression of the root-to- quence data. Making these choices is not trivial. Despite the fact tip divergence with the calendar time can be helpful in deter- that prior distribution for in total four parameters, i.e. origin, be- mining the amount of clock signal in the data. There are only come uninfectious rate, effective reproductive number and the few rules on how exactly these tools should be used. The au- sampling probability, needs to be specified in the birth–death thors of the TempEst say that ‘[...] the estimation of a negative model, this task may be easier than specifying the prior distri- evolutionary rate indicates that the data set contains little or no bution on the single parameter, i.e. effective population size, of temporal signal’ (Rambaut et al. 2016). They also point out that the coalescent model. Epidemiological surveillance data often the amount of signal in the data can be assessed visually or provide information on the duration of the infection in the pa- through the correlation coefficient, R , provided by the method. tient and also on the sampling probability. Furthermore, assum- However, guidelines that describe how to exactly perform this ing R around 1 is reasonable for most infections. However, visual inspection, or what the reasonable cutoff for R is to de- there is usually no information on the effective infected popula- termine if the information in the data is sufficient to allow for tion size, which is in fact a different quantity than the infected the separation of the clock rate and the tree height signal are population size. lacking. Assessment of the clock signal in the data using this A good practice in Bayesian phylodynamics is to first run the tool is therefore very subjective. analysis without the sequence data and only add the sequence Such regression methods as well as a molecular clock selec- data in a second run. Such analyses can be very revealing about tion approach have been found to lead to wrong estimates of both the amount of information in the sequences and the as- the amount of clock signal in the data. This is especially the sumptions of the underlying models. Analysing just the Downloaded from https://academic.oup.com/ve/article-abstract/4/1/vex044/4829709 by Ed 'DeepDyve' Gillespie user on 21 March 2018 12 | Virus Evolution, 2018, Vol. 4, No. 1 sampling times of ZIKV data sets with the coalescent tree prior height distributions interact once the sequences are added. This led to wide distributions for the tree height parameter. When phenomenon has been also recently observed in simulations the sequence data was included, the tree height distributions (Mo ¨ ller 2017). peaked for the case of the ALL data set but remained very broad Only when we are guaranteed that the data contains enough for the BRAZIL and the USA sequences only, indicating a lack of signal for the clock calibration, we can proceed with interpret- signal in the two latter data sets. Even though we assumed a ing phylodynamic parameters. In case of limited data, it is pos- flat prior on the origin, the birth–death model analyses yielded sible to carry out the analyses by fixing one of the parameters: peaked estimates of the tree height already when just the infor- the clock rate or the tree height. Fixing the former is straightfor- mation on the sampling times was included. This is due to the ward as the clock rate is a parameter for which a prior is chosen, fact that the sampling process is part of the birth–death model however, fixing the latter is more complex as the tree height outcome. Thus, the sampling times already provide information prior distribution is induced by the prior distributions on the about the epidemiological dynamics. In this sense, the birth– model parameters. One may simply fix the MRCA node height, death models are very similar to classic epidemiology, where i.e. use a calibration node. However, superimposing of such the sampling times without sequences are used for estimation MRCA node age prior on top of the population dynamic prior of epidemiological parameters. can lead to unexpected actual prior distributions (Heled and If the birth–death model is an appropriate model regarding Drummond 2015). In addition to the technical issue with fixing the dynamics of the (sampling and epidemiological) process, the clock rate or the tree height parameters in the analyses, sampling times provide useful information regarding the pa- availability of reliable estimates also influences which parame- rameters of the process. However, if the birth–death model does ter one chooses to fix. Clock rate estimates are often available not describe the process correctly, a peaked distribution ob- for many data sets. Fixing this value and assuming the sam- tained based on sampling times can be misleading, especially if pling process is unbiased across the population at any point in this distribution is shifted away from the true parameter value. time, we can infer epidemiological parameters using the birth– When sequence data is combined with such shifted distribu- death (skyline) model. It has been shown before, that if the sam- tions, and the information in the sequences is not strong pling process is mis-specified in the birth–death model, e.g. if it enough to push the distribution to the correct values, wrong is fixed to a high value, while the true value is low, then other conclusions about important epidemiological parameters may epidemiological parameters will be biased too (see effect of mis- be drawn. specification due to parameter correlations in Boskova, Sampling from an actual prior distribution under the birth– Bonhoeffer, and Stadler (2014)). If the sampling process is biased death model involves sampling the number of samples and the in such a way that certain parts of the transmission tree are sampling times along with all other parameters. One option to sampled more than others, and one fails to capture those varia- obtain such a prior distribution under the birth–death model tions in the model both the birth–death and the coalescent would be to simulate trees under a given set of epidemiological model may produce biases as both of these models assume that parameters using, e.g. MASTER (Vaughan and Drummond 2013) at each point in time, a random sample is taken from the full or TreeSim package in R (Stadler 2011). If such sampling was process. carried out, one would obtain prior distributions on all parame- We performed a phylodynamic analysis on the Florida, USA ters together with prior distribution of the sample number and data set, by fixing the clock rate, and find evidence for the in- sampling times. In such a case, if the prior distribution on the crease in spread of the ZIKV in Florida in mid-2016 followed by origin parameter was specified to be uniform, sampling from the decline in spread after August 2016. We used the birth– the actual prior will produce a uniform prior distribution for the death skyline model with up to 6 intervals for R to analyse the origin parameter. However, the prior distribution for the tree USA data. The reason for not increasing this number further height parameter would be defined by the exact combination of was that we did not want to over-parametrise the model. In the birth rate, the death rate and the sampling probability particular, the data set only consisted of 23 sequences and in- parameter. cluding more parameters to estimate in the model would only lead to wider posterior intervals. An ideal alternative to the simple skyline model would be to use a smoothing prior, inform- 4.3 Analysis of limited sequence data using Bayesian ing each next interval by the results gathered from the previous phylodynamics one. In the future, we hope such prior will become available for the birth–death skyline model, allowing for more detailed Sequence data carry information for two distinct parameters: analyses of small data sets such as the ZIKV Florida cluster ex- the tree topology and the clock rate. If the sequences are not di- plored here. vergent enough from one another, the topology will be hard to estimate. If the sequences are sampled close to one another in time, relative to the substitution rate, even if the topology is cor- 5. Conclusions rectly resolved, it will be hard to tease apart the tree height and the clock (substitution) rate. The former problem is dealt with Based on our results presented here, we suggest to perform using the Bayesian analysis, since the tree topology can be Bayesian phylodynamic analyses during ongoing epidemics treated as a nuisance parameter and is integrated out. When se- with highest care with respect to model robustness. Instead of quences do not contain enough information to calibrate the mo- using just one set of priors to perform an analysis, it is essential lecular clock, only the product of the clock rate and the tree to check the robustness of the estimates obtained with the cho- height can be estimated reliably. Here, we have shown that sen models and different prior settings. If the data has enough once we attempt to decompose this product into the clock rate phylogenetic and clock signal, the posterior distributions will be and tree height estimates using insufficient data, severe prob- consistent over the range of models chosen. Only then, the pos- lems arise. Sequences with little information on the clock rate terior distributions can reliably reflect important parameters may shift the estimates in a wrong direction (compared to the such as the effective reproductive number, the start or the size estimates without sequences) as the clock rate and the tree of the epidemic. Downloaded from https://academic.oup.com/ve/article-abstract/4/1/vex044/4829709 by Ed 'DeepDyve' Gillespie user on 21 March 2018 V. Boskova et al. | 13 Grubaugh, N. D. et al. (2017) ‘Genomic Epidemiology Reveals Acknowledgements Multiple Introductions of Zika Virus into the United States’, The authors express their gratitude to Louis du Plessis for Nature, 546: 401–05. helpful discussions and for making his skyline plotting Guindon, S. et al. (2010) ‘New Algorithms and Methods to scripts and TreeSlicer class from unreleased BEAST2 pack- Estimate Maximum-Likelihood Phylogenies: Assessing the age smoothingpriors (https://github.com/laduplessis/smooth Performance of Phyml 3.0’, Systematic Biology, 59: 307–21. ingpriors) available. V.B., T.S., and C.M. thank ETH Zu ¨ rich for Heled, J., and Drummond, A. J. (2015) ‘Calibrated Birth–Death funding. T.S. and V.B. were supported by the European Phylogenetic Time-Tree Priors for Bayesian Inference’, Research Council under the Seventh Framework Systematic Biology, 64: 369–83. Programme of the European Commission (PhyPD: grant Kampstra, P. (2008) ‘Beanplot: A Boxplot Alternative for Visual Comparison of Distributions’, Journal of Statistical Software, 28: agreement number 335529). 1–9. Kingman, J. F. (1982) ‘On the Genealogy of Large Populations’, Journal of Applied Probability, 19: 27–43. Authors’ contributions Korber, B. et al. (2000) ‘Timing the Ancestor of the HIV-1 Pandemic Strains’, Science, 288: 1789–96. V.B., T.S., and C.M. designed the study and the analyses. V.B. Metsky, H. C. et al. (2017) ‘Zika Virus Evolution and Spread in the performed the analyses. V.B., T.S., and C.M. wrote the paper. Americas’, Nature, 546: 411–15. Mo ¨ ller, S. (2016) ‘The Impact of the Tree Prior and Purifying Selection on Estimating Clock Rates during Viral Epidemics’, ETH Zurich. Supplementary data Murray, G. G. et al. (2016) ‘The Effect of Genetic Structure on Supplementary data are available at Virus Evolution online. Molecular Dating and Tests for Temporal Signal’, Methods in Ecology and Evolution, 7: 80–9. Nee, S. (2006) ‘Birth-Death Models in Macroevolution’, Annual References Review of Ecology, Evolution, and Systematics, 37: 1–17. Anderson, R. M., and May, R. M. (1991) Infectious Diseases of Neuwirth, E. (2014) Rcolorbrewer: Colorbrewer Palettes. R Humans: Dynamics and Control. Oxford: Oxford University Press. package version 1.1-2. Boskova, V., Bonhoeffer, S., and Stadler, T. (2014) ‘Inference of Oehler, E. et al. (2014) ‘Zika Virus Infection Complicated by Epidemiological Dynamics Based on Simulated Phylogenies Guillain-Barre Syndrome—Case Report, French Polynesia, December 2013’, Eurosurveillance, 19: pii: 20720. Using Birth-Death and Coalescent Models’, PLoS Computational Biology, 10: e1003913. Pan American Health Organization (PAHO) (2017) <http://www. Bouckaert, R. et al. (2014) ‘BEAST 2: A Software Platform for paho.org> accessed 9 June 2017. Bayesian Evolutionary Analysis’, PLoS Computational Biology, 10: Pan American Health Organization, and World Health Organization Regional Ofﬁce for the Americas (2015) e1003537. Drummond, A. J. et al. (2005) ‘Bayesian Coalescent Inference of ‘Epidemiological Alert: Neurological Syndrome, Congenital past Population Dynamics from Molecular Sequences’, Malformations, and Zika Virus Infection’. Implications for Public Molecular Biology and Evolution, 22: 1185–92. Health in the Americas. Technical report. <http://www.paho. org/hq/index.php?option¼com_docman&task¼doc_download et al. (2006) ‘Relaxed Phylogenetics and Dating with Conﬁdence’, PLoS Biology, 4: e88. &Itemid¼&gid¼32405&lang¼en>. Paradis, E., Claude, J., and Strimmer, K. (2004) ‘APE: Analyses of , and Rambaut, A. (2007) ‘BEAST: Bayesian Evolutionary Analysis by Sampling Trees’, BMC Evolutionary Biology, 7: 214. Phylogenetics and Evolution in R Language’, Bioinformatics du Plessis, L., and Stadler, T. (2015) ‘Getting to the Root of (Oxford, England), 20: 289–90. Epidemic Spread with Phylodynamic Analysis of Genomic R Core Team (2013) R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Data’, Trends in Microbiology, 23: 383–6. Duche ˆ ne, S. et al. (2015) ‘The Performance of the Austria. <http://www.R-project.org/>. Date-Randomization Test in Phylogenetic Analyses of Rambaut, A. (2012) Figtree v1.4. Molecular evolution, phyloge- Time-Structured Virus Data’, Molecular Biology and Evolution, 32: netics and epidemiology. Edinburgh, UK: University of Edinburgh, Institute of Evolutionary Biology. <http://tree.bio. 1895–906. Edgar, R. C. (2004) ‘MUSCLE: Multiple Sequence Alignment with ed.ac.uk/software/ﬁgtree/>. High Accuracy and High Throughput’, Nucleic Acids Research, et al. (2014) Tracer v1.6, Molecular evolution, phylogenetics 32: 1792–7. and epidemiology, Edinburgh. UK: University of Edinbugh, Faria, N. R. et al. (2016) ‘Zika Virus in the Americas: Early Institute of Evolutionary Biology. <http://beast.bio.ed.ac.uk/ Epidemiological and Genetic Findings’, Science, 352: 345–9. tracer>. et al. (2016) ‘Exploring the Temporal Structure of et al. (2017) ‘Establishment and Cryptic Transmission of Zika Virus in Brazil and the Americas’, Nature, 546: 406–10. Heterochronous Sequences Using Tempest (Formerly Ferguson, N. M. et al. (2016) ‘Countering the Zika Epidemic in Path-o-Gen)’, Virus Evolution, 2: vew007. Latin America’, Science, 353: 353–4. , and Drummond, A. (2015) Treeannotator v2.3.0. Edinburgh, UK: University of Edinburgh, Institute of Florida Department of Health (2017) <http://www.ﬂoridahealth. gov/diseases-and-conditions/zika-virus/index.html> ac- Evolutionary Biology and Auckland, NZ: University of cessed 26 July 2017. Auckland, Department of Computer Science. Grenfell, B. T. et al. (2004) ‘Unifying the Epidemiological and Ramsden, C. et al. (2008) ‘High Rates of Molecular Evolution in Evolutionary Dynamics of Pathogens’, Science, 303: 327–32. Hantaviruses’, Molecular Biology and Evolution, 25: 1488–92. Downloaded from https://academic.oup.com/ve/article-abstract/4/1/vex044/4829709 by Ed 'DeepDyve' Gillespie user on 21 March 2018 14 | Virus Evolution, 2018, Vol. 4, No. 1 To, T. H., et al. (2015) ‘Fast Dating Using Least-Squares Criteria Revell, L. J. (2012) ‘Phytools: An R Package for Phylogenetic Comparative Biology (and Other Things)’, Methods in Ecology and Algorithms’, Systematic Biology, 65: 82–97. and Evolution, 3: 217–23. Vaughan, T. G., and Drummond, A. J. (2013) ‘A Stochastic Saitou, N., and Nei, M. (1987) ‘The Neighbor-Joining Method: A Simulator of Birth–Death Master Equations with New Method for Reconstructing Phylogenetic Trees’, Molecular Application to Phylodynamics’, Molecular Biology and Evolution, Biology and Evolution, 4: 406–25. 30: 1480–93. Smith, B. J. (2007) ‘boa: An R Package for MCMC Output Ventura, C. V. et al. (2016) ‘Zika Virus in Brazil and Macular Convergence Assessment and Posterior Inference’, Journal of Atrophy in a Child with Microcephaly’, The Lancet, 387: 228. Statistical Software, 21: 1–37. virological.org (2016). <http://virological.org> accessed 30 Nov Stadler, T. (2010) ‘Sampling-through-Time in Birth-Death Trees’, 2016. Journal of Theoretical Biology, 267: 396–404. World Health Organization (WHO) (2016) Geneva <http://www. (2011) ‘Simulating Trees with a Fixed Number of Extant who.int/emergencies/zika-virus/quarterly-update-october/en/ Species’, Systematic Biology, 60: 676–84. > accessed 1 Dec 2016. et al. (2013) ‘Birth–Death Skyline Plot Reveals Temporal —— (2016) ‘Zika Situation Report – Neurological syndrome and Changes of Epidemic Spread in HIV and Hepatitis C Virus congenital anomalies. Technical report’ <http://apps.who. (HCV)’, Proceedings of the National Academy of Sciences of the int/iris/bitstream/10665/204348/1/zikasitrep_5Feb2016_eng.pdf? United States of America, 110: 228–33. ua¼1>. Downloaded from https://academic.oup.com/ve/article-abstract/4/1/vex044/4829709 by Ed 'DeepDyve' Gillespie user on 21 March 2018 http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Virus Evolution Oxford University Press http://www.deepdyve.com/lp/oxford-university-press/the-influence-of-phylodynamic-model-specifications-on-parameter-aAqW6brw76

Loading next page...

References (42)

H. Metsky, C. Matranga, S. Wohl, S. Schaffner, C. Freije, S. Winnicki, K. West, J. Qu, M. Baniecki, A. Gladden-Young, Aaron Lin, C. Tomkins-Tinch, Simon Ye, Daniel Park, C. Luo, K. Barnes, R. Shah, B. Chak, G. Barbosa-Lima, E. Delatorre, Y. Vieira, Lauren Paul, A. Tan, Carolyn Barcellona, M. Porcelli, C. Vasquez, A. Cannons, Marshall Cone, K. Hogan, Edgar Kopp, J. Anzinger, Kimberly García, L. Parham, Rosa Ramirez, M. Montoya, D. Rojas, Catherine Brown, S. Hennigan, B. Sabina, S. Scotland, K. Gangavarapu, N. Grubaugh, G. Oliveira, R. Robles-Sikisaka, A. Rambaut, L. Gehrke, S. Smole, M. Halloran, L. Villar, S. Mattar, I. Lorenzana, J. Cerbino-Neto, C. Valim, W. Degrave, P. Bozza, A. Gnirke, K. Andersen, S. Isern, S. Michael, F. Bozza, T. Souza, I. Bosch, N. Yozwiak, B. MacInnis, Pardis Sabeti (2017)
Genome sequencing reveals Zika virus diversity and spread in the Americas
bioRxiv
R. Bouckaert, Joseph Heled, D. Kühnert, T. Vaughan, Chieh-Hsi Wu, Dong Xie, M. Suchard, A. Rambaut, A. Drummond (2014)
BEAST 2: A Software Platform for Bayesian Evolutionary Analysis
PLoS Computational Biology, 10
Robert Edgar (2004)
MUSCLE: multiple sequence alignment with high accuracy and high throughput.
Nucleic acids research, 32 5
P. Kaye (1993)
Infectious diseases of humans: Dynamics and control
Immunology Today, 14
P. Kampstra (2008)
Beanplot: A Boxplot Alternative for Visual Comparison of Distributions
Journal of Statistical Software, 28
A. Drummond, A. Rambaut, B. Shapiro, O. Pybus (2005)
Bayesian coalescent inference of past population dynamics from molecular sequences.
Molecular biology and evolution, 22 5
N. Saitou, M. Nei (1987)
The neighbor-joining method: a new method for reconstructing phylogenetic trees.
Molecular biology and evolution, 4 4
A. Rambaut, Tommy Lam, Luiz Carvalho, O. Pybus (2016)
Exploring the temporal structure of heterochronous sequences using TempEst (formerly Path-O-Gen)
Virus Evolution, 2
T. Vaughan, A. Drummond (2013)
A Stochastic Simulator of Birth–Death Master Equations with Application to Phylodynamics
Molecular Biology and Evolution, 30
A. Drummond, S. Ho, M. Phillips, A. Rambaut (2006)
Relaxed Phylogenetics and Dating with Confidence
PLoS Biology, 4
Stéphane Guindon, Jean-François Dufayard, Vincent Lefort, M. Anisimova, W. Hordijk, O. Gascuel (2010)
New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0.
Systematic biology, 59 3
T. Stadler, D. Kühnert, S. Bonhoeffer, A. Drummond (2012)
Birth–death skyline plot reveals temporal changes of epidemic spread in HIV and hepatitis C virus (HCV)
Proceedings of the National Academy of Sciences, 110
Gemma Murray, F. Wang, E. Harrison, G. Paterson, A. Mather, S. Harris, M. Holmes, A. Rambaut, J. Welch (2015)
The effect of genetic structure on molecular dating and tests for temporal signal
Methods in Ecology and Evolution, 7
Zika Situation Report -Neurological syndrome and congenital anomalies
B. Korber, M. Muldoon, J. Theiler, F. Gao, R. Gupta, A. Lapedes, B. Hahn, Steven Wolinsky, Tanmoy Bhattacharya (2000)
Timing the ancestor of the HIV-1 pandemic strains.
Science, 288 5472
C. Ramsden, F. Melo, L. Figueiredo, E. Holmes, P. Zanotto (2008)
High rates of molecular evolution in hantaviruses.
Molecular biology and evolution, 25 7
Brian Smith (2007)
boa: An R Package for MCMC Output Convergence Assessment and Posterior Inference
Journal of Statistical Software, 21
S. Nee (2006)
Birth-Death Models in Macroevolution
Annual Review of Ecology, Evolution, and Systematics, 37
Simon Möller (2016)
The Impact of the Tree Prior and Purifying Selection on Estimating Clock Rates During Viral Epidemics
N. Grubaugh, J. Ladner, M. Kraemer, G. Dudas, A. Tan, K. Gangavarapu, M. Wiley, Stephen White, J. Thézé, D. Magnani, Karla Prieto, Daniel Reyes, Andrea Bingham, Lauren Paul, R. Robles-Sikisaka, G. Oliveira, D. Pronty, Carolyn Barcellona, H. Metsky, M. Baniecki, K. Barnes, B. Chak, C. Freije, A. Gladden-Young, A. Gnirke, C. Luo, B. MacInnis, C. Matranga, Daniel Park, J. Qu, S. Schaffner, C. Tomkins-Tinch, K. West, S. Winnicki, S. Wohl, N. Yozwiak, J. Quick, J. Fauver, K. Khan, S. Brent, R. Reiner, Paola Lichtenberger, M. Ricciardi, Varian Bailey, D. Watkins, Marshall Cone, Edgar Kopp, K. Hogan, A. Cannons, Reynald Jean, A. Monaghan, R. Garry, N. Loman, Nuno Faria, M. Porcelli, C. Vasquez, Elyse Nagle, D. Cummings, D. Stanek, A. Rambaut, M. Sanchez-Lockhart, Pardis Sabeti, Leah Gillis, S. Michael, T. Bedford, O. Pybus, S. Isern, G. Palacios, K. Andersen (2017)
Genomic epidemiology reveals multiple introductions of Zika virus into the United States
Nature, 546
(2014)
Rcolorbrewer: Colorbrewer Palettes. R package version 1.1-2
N. Ferguson, Z. Cucunubá, I. Dorigatti, G. Nedjati-Gilani, C. Donnelly, M. Basáñez, P. Nouvellet, J. Lessler (2016)
Countering the Zika epidemic in Latin America
Science, 353
E. Oehler, L. Watrin, P. Larre, I. Leparc-Goffart, S. Lastère, F. Valour, Laure Baudouin, H. Mallet, D. Musso, F. Ghawché (2014)
Zika virus infection complicated by Guillain-Barre syndrome--case report, French Polynesia, December 2013.
Euro surveillance : bulletin Europeen sur les maladies transmissibles = European communicable disease bulletin, 19 9
Joseph Heled, A. Drummond (2013)
Calibrated Birth–Death Phylogenetic Time-Tree Priors for Bayesian Inference
Systematic Biology, 64
Nuno Faria, J. Quick, Ingra Morales, J. Thézé, J. Jesus, M. Giovanetti, M. Kraemer, S. Hill, Allison Black, A. Costa, L. Franco, S. Silva, Chiej-Hsi Wu, Jayna Ragwhani, S. Cauchemez, L. Plessis, M. Verotti, W. Oliveira, E. Carmo, G. Coelho, A. Santelli, L. Vinhal, C. Henriques, J. Simpson, M. Loose, K. Andersen, N. Grubaugh, S. Somasekar, C. Chiu, L. Lewis-Ximenez, S. Baylis, A. Chieppe, S. Aguiar, Carlos Fernandes, P. Lemos, B. Nascimento, H. Monteiro, I. Siqueira, Maria Queiroz, T. Souza, J. Bezerra, Marta Lemos, G. Pereira, D. Teixeira, L. Moura, R. Dhalia, R. Franca, T. Magalhaes, E. Marques, G. Wallau, M. Lima, V. Nascimento, E. Cerqueira, M. Lima, D. Mascarenhas, José Neto, A. Levin, T. Mendoza, S. Fonseca, M. Mendes-Corrêa, F. Milagres, A. Segurado, E. Holmes, A. Rambaut, T. Bedford, Marcio Teixeira, E. Sabino, L. Alcântara, N. Loman, O. Pybus (2017)
Establishment and cryptic transmission of Zika virus in Brazil and the Americas
Nature, 546
C. Ventura, M. Maia, Vasco Bravo-Filho, Adriana Gois, R. Belfort (2016)
Zika virus in Brazil and macular atrophy in a child with microcephaly
The Lancet, 387
Thu-Hien To, Matthieu Jung, S. Lycett, O. Gascuel (2015)
Fast Dating Using Least-Squares Criteria and Algorithms
Systematic Biology, 65
R. Team (2014)
R: A language and environment for statistical computing.
MSOR connections, 1
A. Drummond, A. Rambaut (2007)
BEAST: Bayesian evolutionary analysis by sampling trees
BMC Evolutionary Biology, 7
Christopher Dugaw (2019)
Birth–Death Models
L. Plessis, T. Stadler (2015)
Getting to the root of epidemic spread with phylodynamic analysis of genomic data.
Trends in microbiology, 23 7
L. Revell (2012)
phytools: an R package for phylogenetic comparative biology (and other things)
Methods in Ecology and Evolution, 3
J. Kingman (1982)
On the genealogy of large populations
Journal of Applied Probability, 19
E. Paradis, J. Claude, K. Strimmer (2004)
APE: Analyses of Phylogenetics and Evolution in R language
Bioinformatics, 20 2
(2015)
Epidemiological Alert: Neurological Syndrome, Congenital Malformations, and Zika Virus Infection'. Implications for Public Health in the Americas
V. Boskova, S. Bonhoeffer, T. Stadler (2014)
Inference of Epidemiological Dynamics Based on Simulated Phylogenies Using Birth-Death and Coalescent Models
PLoS Computational Biology, 10
B. Grenfell, O. Pybus, J. Gog, J. Wood, J. Daly, J. Mumford, E. Holmes (2004)
Unifying the Epidemiological and Evolutionary Dynamics of Pathogens
Science, 303
(2014)
Molecular evolution, phylogenetics and epidemiology. Edinburgh, UK: University of Edinburgh, Institute of Evolutionary Biology.
N. Faria, R. Azevedo, M. Kraemer, R. Souza, M. Cunha, S. Hill, J. Thézé, M. Bonsall, T. Bowden, I. Rissanen, I. Rocco, J. Nogueira, A. Maeda, F. Vasami, Fernando Macedo, A. Suzuki, S. Rodrigues, A. Cruz, B. Nunes, D. Medeiros, D. Rodrigues, Alice Queiroz, Eliana Silva, D. Henriques, E. Rosa, C. Oliveira, L. Martins, H. Vasconcelos, L. Casseb, D. Simith, J. Messina, L. Abade, J. Lourenço, L. Alcântara, M. Lima, M. Giovanetti, S. Hay, Rodrigo Oliveira, P. Lemos, L. Oliveira, C. Lima, S. Silva, J. Vasconcelos, L. Franco, J. Cardoso, J. Vianez-Júnior, D. Mir, G. Bello, E. Delatorre, K. Khan, Marisa Creatore, G. Coelho, Wanderson Oliveira, R. Tesh, O. Pybus, M. Nunes, P. Vasconcelos (2016)
Zika virus in the Americas: Early epidemiological and genetic findings
Science, 352
T. Stadler (2010)
Sampling-through-time in birth-death trees.
Journal of theoretical biology, 267 3
S. Duchêne, D. Duchêne, E. Holmes, S. Ho (2015)
The Performance of the Date-Randomization Test in Phylogenetic Analyses of Time-Structured Virus Data.
Molecular biology and evolution, 32 7
T. Stadler (2011)
Simulating trees with a fixed number of extant species.
Systematic biology, 60 5

Publisher: Oxford University Press
Copyright: © The Author(s) 2018. Published by Oxford University Press.
eISSN: 2057-1577
DOI: 10.1093/ve/vex044
Publisher site: See Article on Publisher Site

Abstract

Each new virus introduced into the human population could potentially spread and cause a worldwide epidemic. Thus, early quantiﬁcation of epidemic spread is crucial. Real-time sequencing followed by Bayesian phylodynamic analysis has proven to be extremely informative in this respect. Bayesian phylodynamic analyses require a model to be chosen and prior distributions on model parameters to be speciﬁed. We study here how choices regarding the tree prior inﬂuence quantiﬁca- tion of epidemic spread in an emerging epidemic by focusing on estimates of the parameters clock rate, tree height, and re- productive number in the currently ongoing Zika virus epidemic in the Americas. While parameter estimates are quite ro- bust to reasonable variations in the model settings when studying the complete data set, it is impossible to obtain unequivocal estimates when reducing the data to local Zika epidemics in Brazil and Florida, USA. Beyond the empirical in- sights, this study highlights the conceptual differences between the so-called birth–death and coalescent tree priors: while sequence sampling times alone can strongly inform the tree height and reproductive number under a birth–death model, the coalescent tree height prior is typically only slightly inﬂuenced by this information. Such conceptual differences to- gether with non-trivial interactions of different priors complicate proper interpretation of empirical results. Overall, our ﬁndings indicate that phylodynamic analyses of early viral spread data must be carried out with care as data sets may not necessarily be informative enough yet to provide estimates robust to prior settings. It is necessary to do a robustness check of these data sets by scanning several models and prior distributions. Only if the posterior distributions are robust to rea- sonable changes of the prior distribution, the parameter estimates can be trusted. Such robustness tests will help making real-time phylodynamic analyses of spreading epidemic more reliable in the future. Key words: tree height; substitution rate; tree prior; molecular epidemiology; start of epidemic. et al. 2014; Pan American Health Organization and World Health 1. Introduction Organization Regional Office for the Americas 2015; Ventura et al. In February 2016, the WHO declared the current Zika virus (ZIKV) 2016; Zika Situation Report 2016). Although the ZIKV public health epidemic ongoing in the Americas to be a public health emer- emergency ended in November 2016, the ZIKV epidemic remains gency of international concern (World Health Organization 2016). ‘a highly significant and long-term problem’ according to the This emergency level was reached because of a possible link be- WHO (World Health Organization 2016). tween ZIKV infection and an increased number of microcephaly Bayesian phylodynamic approaches have proven very help- in newborns as well as between ZIKV infection and other neuro- ful in quantifying the spread of the ZIKV outbreak (Faria et al. logical disorders such as the Guillian-Barre´syndrome (Oehler 2016, 2017; Grubaugh et al. 2017; Metsky et al. 2017). Such V C The Author(s) 2018. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/ licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com Downloaded from https://academic.oup.com/ve/article-abstract/4/1/vex044/4829709 by Ed 'DeepDyve' Gillespie user on 21 March 2018 2 | Virus Evolution, 2018, Vol. 4, No. 1 analyses revealed that the current epidemic in the Americas to sample from the posterior distribution P½tree; parjseq; data. most likely goes back to one Asian lineage and was introduced Typically, we assume that seq is independent of data such that between May and December 2013 (Faria et al. 2016) first in P½seqjtree; par; data¼ P½seqjtree; par. The distribution P½seqjtree; par Brazil, from where it spread to other countries in the Americas is referred to as the phylogenetic likelihood and is specified by an (Metsky et al. 2017). The latter hypothesis is confirmed by the evolutionary model, and the distribution P[tree, datajpar] is referred observation that the epidemic in Florida, USA, seems to have to as the phylodynamic likelihood and is specified by a population started from several introductions, most of these introduced dynamic model. In the application of phylodynamics to epidemiol- lineages clustering with Carribean ZIKV sequences, inter- ogy, the sequencing data comes from pathogens obtained from dif- spersed within the bigger cluster of Brazilian sequences ferent hosts, and the population dynamic model is an (Grubaugh et al. 2017). epidemiological model. The term P½par is the prior distribution on In an emerging epidemic, sequence data might be limited the parameters of the population dynamic and evolutionary and may not have yet accumulated enough information in order models. to unequivocally quantify the phylodynamics, i.e. identify the If the sequence data contains a strong phylogenetic signal, underlying phylogenetic tree and the evolutionary and epidemi- the inferred tree topology will be independent of the prior speci- ological parameters. In fact, the following two aspects play a fications, i.e. the tree topology will be purely determined by dif- role in the outcome of such analysis. First, the tree prior may be ferences in the sequences. In addition to the tree topology, we very important when inferring the phylogeny using a Bayesian are typically interested in the tree height, i.e. the age of the framework. In particular, the impact of treating sampling times most recent common ancestor (MRCA) of all samples, and in all as given a priori vs. as part of the data has not been investigated other branching times of the tree. The clock rate is the parame- in a thorough analysis. Second, the interplay of the tree height ter of the evolutionary model that specifies how many substitu- and the clock rate prior in the case of limited empirical data tions a sequence accumulates per calendar time unit. It thereby may impact the tree inference in an unpredictable fashion. This translates the branches in the tree from units of substitutions to effect too, has not been explored in detail so far (but see Mo ¨ ller units of calendar time. If all sequences are sampled at the same (2017) for a thorough simulation study). time point, we can only estimate the product of the tree height In the present work, we explore the relevance of these two and the clock rate. If the data is sampled through time, i.e. the points for the estimation of the tree height based on data from tips of the tree are at different time points, we can in principle the ZIKV outbreak in the Americas, and compare the results ob- tease the two parameters apart (Korber et al. 2000). However, if tained with the Bayesian phylogenetics tool BEAST 2 to simpler there is not enough temporal signal in the data, it might not be regression-based and least squares-based techniques for tree possible to separate the clock rate from the tree height robustly. height inference. The tree height can be used to estimate the Little temporal signal will result in high sensitivity of the tree start of the epidemic. A robust estimate of when the epidemic height (and all tree branch lengths) to prior assumptions on the started is in particular necessary to investigate whether the in- clock rate and the tree height. crease of microcephaly cases in Brazil coincides with the While we can specify a clock rate prior directly through any ongoing ZIKV epidemic. Furthermore, we discuss the appropri- probability distribution, we can only indirectly specify the tree ateness of interpreting the reproductive number—a key epide- height prior through the population dynamic model. In phylo- miological quantity describing the intensity of epidemic dynamics, each population dynamic model is described by a so- spread—which is co-estimated with the tree height using the called tree prior. There are two classes of tree priors that have so-called birth–death model. proven especially useful: the coalescent (Kingman 1982; For the remainder of this section, we provide some back- Drummond et al. 2005), and the birth–death process (Nee 2006; ground regarding phylodynamics, introduce common tree priors, Stadler 2010). In the following paragraphs, we describe their and discuss their conceptual differences. These insights will be main assumptions and point out why the tree height cannot be important for interpreting the results presented in this paper. directly controlled through setting of the model priors. The coalescent describes the dynamics of the population giving rise to the tree backward in time, i.e. going from the tips 1.1 Phylodynamics and tree priors to the root. It is parameterized by the effective population size. The sampling times of sequences are assumed to be given a pri- The main idea of phylodynamic approaches is to quantify popu- ori. Given such a priori sampling times samp and an effective lation dynamic parameters based on the phylogenetic tree ob- population size N as a parameter par, the coalescent induces a tained from sequencing data (Grenfell et al. 2004). In practice, distribution of phylogenetic trees and a distribution of tree when using a Bayesian framework, the phylogenetic tree is not heights, fixed either, but co-inferred with the population dynamic as well as the evolutionary parameters based on the sequencing data (Drummond and Rambaut 2007; du Plessis and Stadler P ½treejpar ¼ N : samp e 2015). More formally, the Bayesian phylodynamic approach aims at inferring the posterior distribution P½tree; parjseq; data Given the sampling times, we therefore indirectly control where par is the vector of all parameters of the population dy- the tree height via the effective population size. namic and evolutionary models, seq is the sequencing data, and Birth–death models describe the dynamics of a population data denotes other potential sources of data. We can re-write, forward in time. They are parameterized by the birth (trans- mission) rate k, the death (become uninfectious) rate d,the P½tree;parjseq;data¼ P½seqjtree;par;dataP½tree;datajparP½par=P½seq;data: sampling probability p, and thetimeof the start of thepopula- tion (outbreak; also called origin time) T. Note that the repro- The normalizing constant P½seq; data cannot be calculated di- ductive number, a key epidemiological quantity (Anderson rectly, and thus Markov chain Monte Carlo methods are employed and May 1991), can be directly extracted from these Downloaded from https://academic.oup.com/ve/article-abstract/4/1/vex044/4829709 by Ed 'DeepDyve' Gillespie user on 21 March 2018 V. Boskova et al. | 3 parameters by simply dividing the birth rate by the death rate. 2. Methods The birth–death models give rise to a distribution of phyloge- 2.1 Data sets netic trees and in particular a distribution of sampling times and tree heights, On 3 November 2016 we gathered 252 full genome ZIKV consen- sus sequences. These consensus sequences stem from various P½tree; data ¼ sampjpar ¼ðk; d; p; TÞ: sources: we obtained 185 sequences from GenBank, 33 sequences from the Zibra project (https://github.com/zibraproject/zibrapro Thus, the upper limit of the tree height can be controlled, as it ject.github.io/tree/master/data/consensus), 17 sequences from the is strictly smaller than the origin time T, while the precise tree Andersen et al. github page (https://github.com/andersen-lab/ height is influenced by the birth, death and sampling parameters. zika-florida/tree/master/consensus_sequences), and 17 sequences Notice that while the coalescent conditions on the sampling from Ladner et al. github page (https://github.com/jtladner/ZIKA_ times when the tree is induced by the population size parame- Florida/tree/master/sequences). From this set, we removed ter, in the birth–death model the sampling times along with the sequences isolated before 2014, sequences isolated from other or- tree are induced by the birth–death parameters. This different ganisms than humans, sequences with missing date of isolation, conceptual usage of the sampling times samp under the two i.e. year and/or month information missing, sequences resulting approaches has important implications towards assessing the from vertical (mother to child) transmissions and sequences amount of information in the data regarding the tree height which appeared to be duplicates, i.e. sequences that seemed to with popular Bayesian phylodynamics softwares. A Bayesian have been isolated from the same individual on the same date phylogenetic analysis sampling from the posterior distribution but had different passaging histories or were sequenced by differ- of trees tree and parameters par using the coalescent conditions ent labs. This resulted in a set of 139 sequences. We refer to this on the sampling times samp: data set as the ‘ALL’ data set, because it includes sequences from different parts of the Americas. We also separately analysed a Coal P ½tree; parjseq/ P½seqjtree; parP ½treejparP½par; samp subset (sixty-seven sequences) of this full data set which only samp contained the sequences coming from Brazil (abbreviated as Coal BRAZIL). Additionally, we separately analysed a set of sequences where P ½treejpar is the probability density of a coalescent samp from Florida, USA (twenty-three sequences, abbreviated USA). tree tree given the sampling times samp. On the other hand, the These sequences formed a monophyletic cluster within the maxi- birth–death model uses the number of samples together with mum clade credibility (MCC) trees of the ‘ALL’ data set upon re- the associated sampling times samp and the sequence align- moval of one sequence from the Dominican Republic (Fig. 1 and ment seq as data: Supplementary Figs S1 and S2, see figure legend for a description of the model settings with which we obtained the MCC tree). For BD P½tree; parjseq; samp/ P½seqjtree; parP ½tree; sampjparP½par; the sake of simplicity, we refer to this cluster as monophyletic in the following. BD where P ½tree; sampjpar is the probability density of the sam- pling times samp together with the birth–death tree tree. Common practice for investigating the signal in the data is to run the Bayesian method ‘under the prior’ and then compare the 2.2 Phylogenetic and phylodynamic analyses results to those obtained from the full Bayesian analysis. In the strict sense, running an analysis ‘under the prior’ would mean We aligned the sequences using MUSCLE v3.8.31 (Edgar 2004). running the analysis when all data is ignored. With the coalescent We constructed trees with a maximum likelihood (ML) method approach, this means that sequencing data is ignored, while the PhyML v3.1 (Guindon et al. 2010) under the HKY substitution number and the time of sequence samples is used, as it is not model, estimating the equilibrium frequencies, and starting the considered to be a part of the data but rather a priori information. ML tree search using ten different random seeds. We assessed In contrast, with the birth–death model, the sequences, the num- the clock-like behaviour of the data using TempEst v1.5 ber of sequences as well as the sequence sampling times are ig- (Rambaut et al. 2016), by looking at the correlation of the sam- nored when running the Bayesian method ‘under the prior’. This pling times with the root-to-tip divergence, optimising the posi- stems from the fact that the birth–death model parameters in- tion of the root of the ML tree by maximising the correlation duce a distribution of the number of sequences as well as the se- metric. From the ML trees, we also estimated the time of the quence sampling times, meaning any information we have in MRCA in calendar time units using LSD v0.3 (To et al. 2015), us- that respect during the analysis is data. ing options -c (constrained mode, i.e. imposing that the node is Many users perform phylodynamic analyses with the soft- always older than any of its descendants) and -r as (searching ware package BEAST 2 (Bouckaert et al. 2014). The graphical for root node on all branches, using constraints). Finally, we user interface BEAUti allows the specification of model priors analysed the sequences using the Bayesian phylogenetic soft- and input data. It also allows the user to specify if ‘sampling ware tool BEAST v2.4.2 (Bouckaert et al. 2014). For all analyses, from prior’ should be performed. If this option is chosen, the se- the tip dates were fixed to the dates found in the sequence an- quencing data, but not the sampling times, are ignored. For the notation. For those sequences with the day-of-sampling infor- coalescent models this means that one obtains the prior distri- mation missing, we simply set the date to be year/month/15 butions. However, for the birth–death models this means that (total of 6/139 sequences). The results are consistent with those one obtains a posterior distribution of trees and parameters us- obtained from the analyses where the samples with incomplete ing sampling times data, while only the sequencing data is ig- sampling date information were excluded. nored. The analysis with the birth–death model when opting for For the Bayesian analyses, we need to specify two modelling ‘sampling from prior’ in BEAST 2 is therefore not equivalent to components, namely the sequence evolution model and the the analysis ‘under the prior’ in the usual Bayesian sense. tree prior model, which we specified as follows. Downloaded from https://academic.oup.com/ve/article-abstract/4/1/vex044/4829709 by Ed 'DeepDyve' Gillespie user on 21 March 2018 4 | Virus Evolution, 2018, Vol. 4, No. 1 Figure 1. MCC tree of the 139 ZIKV sequences included in this study. The posterior clade support is displayed at each branching point. The names of all virus sequences that were isolated in Florida, USA, are highlighted in green and all sequences from Brazil are highlighted in magenta. The cluster of sequences highlighted in green rep- resents the twenty-three strains of the USA data set that form one monophyletic cluster upon exclusion of one non-USA sequence. The MCC tree was obtained using the BDSKY model in BEAST 2 with d ¼ 18.25, three intervals for R and a relaxed clock (see Supplementary Figs S1 and S2 for other model parametrizations). 2.2.1 Sequence evolution model settings 2.2.2 Tree prior model settings For all analyses we chose the HKY substitution model as the se- For the tree prior, we used the birth–death skyline model quence evolution model with substitution rate fixed to 1, LNð1; (Stadler et al. 2013) and the coalescent Bayesian skyline plot 1:25Þ prior and 0 as the lower limit for the j parameter, and esti- (Drummond et al. 2005). The birth–death skyline model is a mated base frequencies and C-distributed variation between birth–death model with constant parameters within intervals sites, using four categories, the shape parameter prior being set to through time while parameters may change across intervals. Exp(1). We allowed for between-branch rate variation using a re- We used one interval for the become uninfectious rate (i.e. the laxed clock model (Drummond et al. 2006). We used the log-nor- ‘death’ rate) and fixed its value to 14.26, 18.25, or 23.40 (corre- mal relaxed clock model with ucldMean prior being LNð0:001; 2Þ in sponding to the inverse of the lower 95-percentile, mean and real space. Furthermore, the ucldStdev prior was set to Cð0:5396; upper 95-percentile of the estimates of the mean ZIKV genera- 0:3819Þ with 0 as the lower limit. This resulted in an effective prior tion time obtained by Ferguson et al. (2016)). The become unin- on the clock model’s rate.mean to have mean of 0.001 and me- fectious rate is the inverse of the time period of being dian of 0.0001. This prior was chosen to reflect the estimates of infectious. All the values of the become uninfectious rate here previous studies, where the substitution rate was estimated to be are in units per year. Furthermore, we used five equidistant in- between 0:98 1:06 10 subst/site/year (note that this unit cor- tervals between the first and last sample for the sampling prob- responds to ‘substitutions per site and year’ or subst site - ability. For each interval we set the prior distribution to –1 year )(Faria et al. 2016) but was suspected to vary between Beta(1, 700). The sampling probability before the first sample –4 –3 2.610 and 4.410 subst/site/year depending on the host or- was set to 0. For the effective reproductive number, R (which is ganism (virological.org 2016). We also performed analyses using the ‘birth’ rate divided by the ‘death’ rate), we used three, four, a strict clock with prior LNð9:2; 2:3Þ for the rate parameter. Note or six equidistant intervals between the root and the last sam- that in the Bayesian framework, the product of the substitution ple with LN(0, 1) prior distribution and 0 as the lower and 50 as rate from the substitution model and the clock rate from the clock the upper bound. For the origin parameter, the prior was LNð5; model is the overall substitution rate of the process. Since we 0:5Þ in real space and lower and upper bounds of 0 and 15, re- fixed the substitution rate in the substitution model to 1, and we spectively. The coalescent Bayesian skyline model is based on a only estimate the clock rate, the clock rate we report effectively coalescent with constant effective population sizes within in- refers to the overall substitution rate. tervals through time while the effective population size may Downloaded from https://academic.oup.com/ve/article-abstract/4/1/vex044/4829709 by Ed 'DeepDyve' Gillespie user on 21 March 2018 V. Boskova et al. | 5 change across intervals. In the coalescent Bayesian skyline plot (Kampstra 2008). Lastly, MCC trees were obtained using we set the prior for the effective population size to a Markov- TreeAnnotator v2.3.0 (Rambaut and Drummond 2015), removing chained distribution, i.e. the prior value of the parameter in the initial 10% of samples as burn-in, setting posterior probability interval follows a Gamma distribution with mean being set to limit to 0.1 and reconstructing node height based on Common the value in the previous interval. We set the prior distribution Ancestor heights criterium. The MCC trees were visualised us- of the population size in the first interval to Unifð0; 10 Þ. We set ing FigTree v1.4.0 (Rambaut 2012). The BEAST 2 source files and the number of intervals for the effective population size and analysis scripts can be found in the Supplementary Materials group size to 3, 4, or 6. and Methods zip file. In summary, for the three ZIKV data sets, we performed analyses under relaxed and strict clock models, using the coa- 3. Results lescent and birth–death models with a range of settings as sum- 3.1 Robustness analysis of the tree height and the marized in Table 1. We ran the chain for 10 steps both with and clock rate without sequencing data (sampling from prior). We sampled pa- 4 6 rameters every 10 steps and trees every 10 steps. The log files First, we set out to estimate the start of the epidemic and the were inspected in Tracer (Rambaut et al. 2014). All parameters clock rate of the ZIKV epidemic for (1) the complete data set in all the runs mixed well (ESS >200) with one exception, containing 139 ZIKV sequences sampled in the Americas (ALL), namely strict clock, all data, birth–death (BD): 4R LN(0, 1), (2) 67 sequences sampled only in Brazil (BRAZIL), and (3) 23 se- without sequences, in Supplementary Fig. S3B (tree height and quences belonging to the locally contained outbreak in Florida, origin had ESS of only 159 and 187, respectively). USA (USA). All data sets contain sequentially sampled se- For the analyses investigating the evolution of R over time quences and therefore should in principle allow for calibration in Florida, USA, we used the sequences of the Florida monophy- of the molecular clock (i.e. estimating the clock rate) and reli- letic cluster. We fixed the ucldMean parameter to l ¼ 9 10 able estimation of the tree height. We used three methods for subst/site/year, which was the mean and the median estimate estimation of the clock rate and the tree height: a regression of the clock rate (rate.mean parameter in BEAST 2) we obtained model correlating sampling time to the root-to-tip divergence when analysing the ALL data set without Florida, USA se- (PhyML combined with TempEst), a least squares-based ap- quences (see Supplementary Fig. S4, w/o Florida). Furthermore, proach (PhyML combined with LSD) and Bayesian phylody- in these analyses we varied the number of intervals for R to be namic approaches (BEAST 2). For the latter, we used two 3, 4, or 6 and the number of intervals for sampling probability p different models, the birth–death skyline model (Stadler et al. to be 1 or 5. 2013) and the coalescent Bayesian skyline plot (Drummond Post-processing of Bayesian analyses was done by first dis- et al. 2005). carding the initial 10% of the MCMC samples as burn-in. Results The start of the epidemic is parameterized in the BD model were then analysed in R (R Core Team 2013) using custom-made 0 as T, and the tree height T is the age of the MRCA. As in the scripts with the R packages boa (Smith 2007) for calculating the Bayesian coalescent (Coal) framework and the two non- highest posterior density (HPD) intervals, and RColorBrewer 0 Bayesian frameworks, only the tree height T rather than T is es- (Neuwirth 2014) for plotting. To obtain the lineages-through 0 timated, we compare the estimated tree height, T , under all time (LTT) plot, the R package ape (Paradis, Claude, and methods, using it as an approximation for the start of the epi- Strimmer 2004) was used in combination with the function demic, T. In what follows, we will refer to the ‘time of the MRCA’ LTT.plot.gen in the R package TreeSim (Stadler 2011). To extract (tMRCA) as the ‘date when the epidemic started’. This quantity the node (tree) height from the tree necessary for plotting the is derived by subtracting the tree height from the date of the LTT along with the R plots, we used R package phytools (Revell most recent tip in the data set. The rationale for comparing 2012). We used the R package beanplot for plotting the probabil- non-Bayesian to Bayesian estimates of the tree height (or of the ity densities of time of the MRCA and clock rate estimates date when the epidemic started) and the clock rate is to assess Table 1. Overview of models used in this study. Phylodynamic model Parameter Value Evolutionary Clock model model Birth–death skyline model with Effective reproductive Estimated; 3, 4, 6 intervals (3 R ,4 HKYþC Relaxed clock serial sampling (BD) number (R ) R ,6 R ) with LN prior with various e e e parameterizations Become uninfectious rate (d) Fixed; Strict clock d 2f14:26; 18:25; 23:40g Sampling probability (P) Estimated; with prior Beta(1, 700) Coalescent Bayesian skyline Effective population size (N ) Estimated; HKYþC Relaxed clock plot (Coal) 3, 4, 6 intervals (3 N ,4 N ,6 N ); with prior on e e e ﬁrst interval Unifð0; 10 Þ, following Strict clock intervals: C with mean population size of previous interval The abbreviations used in Figs 3 and 4 and Supplementary Figs S3 and S4 refer to this scheme. For example, BD: 3R LN(0, 1), d ¼ 18:25 refers to the birth–death skyline model with serial sampling which allows for three intervals for R , each with LN(0, 1) distributed prior, and the become uninfectious rate d ¼ 18.25. Downloaded from https://academic.oup.com/ve/article-abstract/4/1/vex044/4829709 by Ed 'DeepDyve' Gillespie user on 21 March 2018 6 | Virus Evolution, 2018, Vol. 4, No. 1 the amount of information available in the data without adding The samples in the ALL and BRAZIL data sets range over more prior information. than 1.37 years whereas the sequences in the USA data set were only collected during less than four months. The x-intercept of the regression line between the calendar time (x-axis) and the 3.1.1 Regression analysis of the tree height and the clock rate root-to-tip divergence (y-axis) can be used as an approximation To obtain the estimate of the phylogenetic tree, we applied the of the date when the epidemic started. This way, the start of the ML method PhyML v3.1 (Guindon et al. 2010) to the sequences. epidemic in the Americas was estimated to be 12 June 2013. The We assessed the clock signal in the data using TempEst v1.5 Brazilian epidemic was estimated to start 1.6 years earlier, i.e. (Rambaut et al. 2016)(Fig. 2). All three data sets show a positive 30 October 2011. The epidemic in Florida was estimated to have correlation between genetic divergence (root-to-tip divergence) –3 started on 31 May 2016, i.e. three years after the tMRCA of the and sampling time, yielding estimates of clock rate of 1.57 10 –3 ALL data set (see Fig. 2). These results indicate that the clock sig- subst/site/year for the ALL data set, 1.76 10 subst/site/year –3 nal may not be strong enough in (some of) the data sets, as the for the BRAZIL data set, and 6.42 10 subst/site/year for the ALL data set cannot have a younger tMRCA than any of its sub- USA data set. The ALL data set exhibits the strongest associa- data sets. In addition, the estimates for the ALL and USA data tion between the genetic divergence and the sampling time 2 sets are very close to the earlier obtained estimates of the intro- (R ¼ 0:32). The BRAZIL and USA data sets show more diffuse 2 duction of ZIKV into the Americas; however, the estimate for 2 2 patterns with lower R -values: BRAZIL: R ¼ 0:19, USA: R ¼ 0:26. Figure 2. TempEst estimates of tree heights and clock rates. The tree height and the clock rate (i.e. slope) estimates obtained using TempEst for the three data sets: (A) the complete data set (ALL), (B) the sequences isolated in Brazil (BRAZIL), and (C) the (monophyletic cluster of) sequences isolated in Florida (USA). The tree for each data set was reconstructed using the ML method. Downloaded from https://academic.oup.com/ve/article-abstract/4/1/vex044/4829709 by Ed 'DeepDyve' Gillespie user on 21 March 2018 V. Boskova et al. | 7 the BRAZIL data set seems to be much older than estimated be- branches (in BEAST 2, this is the rate.mean parameter). Ignoring fore (Faria et al. 2016, 2017; Grubaugh et al. 2017). the sequences, we obtain the chosen prior distribution for the clock rate under both the BD and the Coal framework as ex- pected (gray distributions in Fig. 4). 3.1.2 Least squares analysis of the tree height and the clock rate The ML tree was also analysed with the least squares dating 3.1.3.2 How much additional information do the sequences contain tool LSD v0.3 (To et al. 2015), in order to estimate the tree height and the clock rate in the ZIKV epidemic. The inferred tree height concerning the tree height and the clock rate? When performing phylogenetic analyses for the ALL data set including the se- for the ALL data set resulted in the estimated date when the epi- demic started to be 8 January 2012. For the Brazilian data set, quence data, we obtain relatively consistent estimates of the median tree heights and clock rates in all models and model the start of the epidemic was estimated to be 28 July 2010. The oldest tree and thus the most ancient start of the epidemic, dat- specifications (thick green lines in Fig. 3 and thick red lines in Fig. 4, respectively). The median estimates of the tree heights ing to 18 October 1980 was estimated for the USA data set. The clock rate estimates for the ALL, BRAZIL, and USA data sets translate to the estimated start of the epidemic ranging from 25 –4 –4 –5 August 2013 to 22 December 2013. The range of median esti- were 5.21 10 , 5.62 10 , and 3.46 10 subst/site/year, re- 3 3 spectively. Again, these results indicate that the clock signal is mates of the clock rate is 1:01 10 1:24 10 subst/site/ year. weak in (some of) our data sets. In the BD analyses with various numbers of intervals for the effective reproductive number R , all 95% HPD intervals of the 3.1.3 Bayesian phylodynamic analysis of the tree height and the tree height include the tree height estimate obtained by clock rate TempEst (12 June 2013)—Fig. 3. This finding is robust also when In the next step, we used phylodynamic analysis to overcome we assume a slower become uninfectious rate, but not for other the problem of point estimates imposed by PhyML combined variations of priors for BD model parameters that we tested, i.e. with TempEst and LSD. In a nutshell, we used two different when assuming a faster become uninfectious rate or when set- models, the birth–death skyline model (Stadler et al. 2013) and ting a strong prior on very large or an unrealistically low R in the coalescent Bayesian skyline plot (Drummond et al. 2005). e the BD model (Supplementary Fig. S3A, BD: 3R LN(2, 0.2), The term ‘skyline’ refers to the parameters (birth, death, sam- e d ¼ 18.25 and BD: 3R LN(–2, 0.2), d ¼ 18.25). In the Coal analy- pling, or effective population size) changing in a piecewise con- e ses with various numbers of intervals for the effective popula- stant fashion through time. As laid out in the introduction, the tion size N , all 95% HPD intervals of the tree height include the two models differ conceptually in how data is used, and as a e tree height estimate obtained by TempEst (12 June 2013). This common practice, the models are run without sequence data as finding is robust for different priors on the effective population well as with all sequence data to determine how much informa- size (Supplementary Fig. S3A). Only an extremely small coales- tion the sequence data contains. Therefore, we will first explain cent population size priors in the Coal model (Supplementary the estimation of the tree height and the clock rate without se- Fig. S3A, Coal: 3 N LN(–2, 0.2)) produce HPDs not including quences and then include the sequence data. e the TempEst estimates. Since in the BD model the effective reproductive number, Neither the BD nor the Coal 95% HPD intervals of the clock the become uninfectious rate, and the sampling probability are –3 rate estimates contain the TempEst estimate of 1.57 10 simultaneously unidentifiable (Stadler et al. 2013; Boskova, subst/site/year (Fig. 4), although they are in the same order of Bonhoeffer, and Stadler 2014), we fixed the become uninfectious magnitude (10 subst/site/year). This is consistent for all mod- rate in all analyses. Ferguson et al. (2016) estimated the inverse els (Supplementary Fig. S4A). The median clock rates estimated of this rate, i.e. the mean ZIKV generation time with its lower in a Bayesian framework are consistently lower than the and upper 95-percentile. For our main analyses, we used their TempEst estimate (Fig. 4 and Supplementary Fig. S4A). Only a mean estimate translating to the become uninfectious rate of prior for extremely small coalescent population size pushes the 18.25, and for the Supplementary analyses we used the upper clock rate estimate to a higher rate (Supplementary Fig. S4A, 95-percentile and the lower 95-percentile of the Ferguson et al. Coal: 3 N LN(–2, 0.2)). estimates, i.e. we set the become uninfectious rate to d ¼ 14:25 e None of the HPD intervals contain the estimates of the tree and 23.40, respectively. height and the clock rate provided by LSD for any of the prior set- tings. The LSD clock rate estimates are consistently lower than 3.1.3.1 What can we learn about the tree height and the clock rate the BD and the Coal estimates. The LSD tree height estimates are when ignoring the sequencing data? When analysing the data sets always above the BD/Coal tree height estimates (see Figs 3 and 4, using the BD model with information on sampling times and and Supplementary Figs S3 and S4). number of samples only, i.e. ignoring the sequence data, we ob- In contrast to the ALL data set, the analysis of the BRAZIL tain very peaked tree height distributions (gray distributions in and USA data sets including the sequence data leads to distribu- Fig. 3). Thus, the number of samples and sampling times con- tions of the tree height and the clock rate that varied strongly tain a lot of information regarding the tree height. The medians amongst the models (Figs 3 and 4 and Supplementary Figs S3A of these distributions depend on different model specifications. and S4A). In particular, the 95% HPD intervals of the tree height When analysing the data sets using the Coal model ignoring the distributions for the BRAZIL and the USA data sets are very sequence data, i.e. running the analysis under the prior, the tree broad (Fig. 3). For the Coal model, the posterior distribution re- height distributions are very wide as expected since the sam- mains very flat (i.e. does not change much from the prior). pling times are conditioned upon under the coalescent (gray These results indicate that the BRAZIL and USA data sets gath- distributions in Fig. 3). ered in November 2016 do not contain a strong enough signal to We performed all analyses shown in Fig. 3 using a relaxed ultimately estimate the tree height reliably using phylodynamic molecular clock that allows for variation of the clock rates be- analysis. tween the different branches of the phylogenetic tree When analysing the BRAZIL and USA data sets separately, (Drummond et al. 2006). The clock rate for the relaxed clock we obtain very different estimates for the clock rate from the BD model is defined as the mean clock rate averaged over all Downloaded from https://academic.oup.com/ve/article-abstract/4/1/vex044/4829709 by Ed 'DeepDyve' Gillespie user on 21 March 2018 8 | Virus Evolution, 2018, Vol. 4, No. 1 Figure 3. The effect of addition of sequence data on the tMRCA estimates in the Bayesian analysis. The probability distribution of the estimates of the tMRCA resulting from the analysis with (green) and without (gray) sequences is shown. The ﬁgure shows estimates obtained under various model assumptions (labels on the x-axis summarize the models as explained in Table 1) and the three different data sets (header). The median date of MRCA is indicated with a thick solid line and the 95% HPD intervals are marked with thin solid lines. The black dashed, dotted, and dashed-dotted lines represent the tMRCA estimates based on the TempEst analysis (Fig. 2) for the ALL, BRAZIL, and USA data sets, respectively. The gray dashed, dotted, and dashed-dotted lines represent the tMRCA of the LSD estimates for the ALL, BRAZIL, and USA data sets, respectively. The become uninfectious rate in the BD model is set to d ¼ 18.25, which is the mean estimate in Ferguson et al. (2016). Notice that the median estimates of the tMRCA for the BRAZIL and USA data sets analysed with the Coal model are beyond the limits of the ﬁgure, so we state the estimated tree heights below. For the BRAZIL data set, the median tree height estimate of distributions resulting from analyses without sequence data for Coal: 3 N C is 6 6 5 1.8 10 years, for Coal: 4 N C is 1.0 10 years and for Coal: 6 N C is 3.5 10 years. The median tree height estimate of the distribution when sequence data is e e included for Coal: 3 N C is 1026.7 years, for Coal: 4 N C is 960.1 years, and for Coal: 6 N C is 1039.8 years. For the USA data set, the median tree height estimate e e e 6 6 of distributions resulting from analyses without sequence data for Coal: 3 N C is 2.0 10 years, for Coal: 4 N C is 1.2 10 years, and for Coal: 6 N C is e e e 5.3 10 years. The median tree height estimate of the distribution when sequence data is included for Coal: 3 N C is 460.8 years, for Coal: 4 N C is 449.1 years, e e and for Coal: 6 N C is 443.5 years. models and the Coal models (Fig. 4). In the BD models, the se- agreement with the TempEst analyses. It remains to be investi- quence data is able to bring the clock rate estimates closer to gated though why LSD produce outliers. However, the BRAZIL the TempEst estimates although the posterior distributions re- and the USA data sets do not contain enough information to main quite diffuse. We do not see a consistent pattern of the perform reliable estimation of tree height and clock rate param- clock rate estimates getting closer to the LSD estimates how- eters of these sub-epidemics. ever. Adding sequence data to the Coal models leads to clock Through the inclusion of the sequence data in the analysis, rate estimates that are even lower than expected from the prior the clock rate and the tree height priors are allowed to interact. (Fig. 4) and thus further away from the TempEst and LSD If the data contains enough signal, the clock rate and the tree estimates. height can be teased apart. However, if the data has a weak sig- In our main analyses we used the relaxed clock model. nal for calibrating the clock, then both the clock rate and the Further, we repeated the analyses with a strict clock model tree height estimates may be biased, and the nature of this bias (Supplementary Figs S3B,C and S4B,C). Employing the strict would be influenced by the interaction of the two prior clock models, we estimated the clock rates and tree heights to distributions. be very similar to those inferred by the relaxed clock models. If there is very little signal in the data for separation of the From our analyses we conclude that the signal contained in clock rate and the tree height, only their product can be estimated the ALL data set is sufficient to relatively consistently estimate reliably. The median of the product of the clock rate and the tree the tree height and the clock rate using Bayesian methods, if height for the BD analyses with three intervals for R for the ALL the priors on the tree height are not too strongly biased towards data set is 0.00291 (95% HPD¼(0.00250, 0.00335)), for BRAZIL extremely early or late introductions. Therefore, one can date 0.00270 (95% HPD¼(0.00192, 0.00347)) and for USA 0.00127 (95% the introduction of ZIKV into the Americas around late 2013. HPD¼(0.00081, 0.00170)). For the Coal analyses with three intervals The posterior distributions of the time of introduction are in for N , we obtain following median estimates of this product: ALL Downloaded from https://academic.oup.com/ve/article-abstract/4/1/vex044/4829709 by Ed 'DeepDyve' Gillespie user on 21 March 2018 V. Boskova et al. | 9 Figure 4. The effect of addition of sequence data on the clock rate estimates in the Bayesian analysis. The probability distribution of the clock rate (rate.mean parame- ter) estimates resulting from the analysis with (red) and without (gray) sequences is shown. The ﬁgure shows estimates obtained under various model assumptions (la- bels on the x-axis summarize the models as explained in Table 1) and the three different data sets (header). The median clock rate is indicated with a thick solid line and the 95% HPD interval marked with thin solid lines. The black dashed, dotted, and dashed-dotted lines represent the clock rate estimates based on the TempEst analysis (Fig. 2) for the ALL, BRAZIL, and USA data sets, respectively. The gray dashed, dotted, and dashed-dotted lines represent the clock rate of the LSD estimates for the ALL, BRAZIL, and USA data sets, respectively. The become uninfectious rate in the BD model is set to d ¼ 18.25, which is the mean estimate in Ferguson et al. (2016). The clock rate displayed is in units of s/s/y, i.e. subst/site/year. 0.00362 (95% HPD¼(0.00299, 0.00441)), BRAZIL 0.00239 (95% particular data set was isolated on 10 July 2016, the start of the HPD¼(0.00157, 0.00360)), and USA 0.00117 (95% HPD¼(0.00079, epidemic would then be (2016:52 2:7 years) October 2013. 0.00163)). For both BD and Coal models, the ALL data set has the However, our clock rate prior has a median of the rate.mean pa- –4 highest product, followed by BRAZIL and finally USA. This order- rameter set to 1.3 10 subst/site/year, which, when combined ing makes sense: the full data set has an underlying tree as old as, with the estimated product of the clock rate and the tree height, 0:00270 or older than, the tree obtained from any subset of the full data would lead to tree height estimates of ¼ 20:77 years, i.e. 0:00013 set. Assuming all the data sets share the same clock rate, the the start of the epidemic would be October 1995. Thus, if the product of tree height and clock rate should be largest for the ALL sequencing data does not contain enough information to cali- –3 data set. Also, the 95% HPD intervals of this product overlap for brate the clock to the expected 1 10 subst/site/year, the in- the Coal and the BD analyses in all three data sets. ferred tree height will increase, i.e. the tMRCA will be pulled up Having in mind that we can estimate this product from our from April 2014 and beyond October 2013, upon inclusion of the data, we can now understand better the inconsistencies ob- sequence data. This insight can explain our large tree height tained when aiming to separate the clock rate and the tree (median estimate of 4.7 years, i.e. the start of the epidemic height using sequencing data. We illustrate the reason for the being October 2011), and the slow posterior clock rate (median –4 inconsistencies using the BRAZIL data set. The BD: 3 R model estimate of the rate.mean parameter being 5.8 10 subst/site/ produces a peaked estimate of the tree height when the tree year which is much slower than estimates by other studies prior is combined with the sampling times only (gray distribu- (Faria et al. 2016)). The situation for the other prior settings for tion in Fig. 3). The median estimate of 2.2 years (start of the BRAZIL and USA data set analysed with the BD model is epidemic April 2014) is close to the previously reported esti- analogous to what we have just described for the BRAZIL BD: mates of the start of the ZIKV epidemic in the Americas 3 R model. (Faria et al. 2016). So why does adding sequence data shift the When the BRAZIL data set is analysed under the Coal: 3 N estimated date of when the epidemic started further into the model, the median estimate of the product of the clock rate and past? The median estimate of the product of the clock rate the tree height is 0.00239 subst/site, which is very similar to the and the tree height is 0.00270 subst/site. If the true clock rate one obtained with the BD model. Again, if we assume the true 3 3 was 1 10 subst/site/year as estimated by Faria et al. (2016), clock rate to be 1 10 subst/site/year, as estimated before the tree height would be 2.7 years. Given the last sample in this (Faria et al. 2016), the tree height would be 2.4 years and the Downloaded from https://academic.oup.com/ve/article-abstract/4/1/vex044/4829709 by Ed 'DeepDyve' Gillespie user on 21 March 2018 10 | Virus Evolution, 2018, Vol. 4, No. 1 start of the epidemic would be estimated to be in February The number of secondary cases caused by a single infected indi- 2014. Our clock rate prior has a median of the rate.mean param- vidual at time t is referred to as the effective reproductive number, –4 eter set to 1.4 10 subst/site/year, leading to estimated me- R , at time t. Similarly to R , the effective reproductive number e 0 dian tree height of 17.07 years, i.e. a start of the epidemic being provides important information on whether the epidemic is in June 1999. However, unlike in the BD model, the tree height able to spread further—this is the case if R > 1, or whether it is prior is very diffuse under the coalescent. The median prior tree prone to die out—i.e. if R < 1. height is 1:8 10 years. So if the data is unable to provide cali- The effective reproductive number during an ongoing epi- bration information for the clock rate, the inferred tree height demic is directly estimated under the birth–death skyline model would increase a lot and the tMRCA will be pulled up from as the birth rate divided by the death rate at time t. One of the February 2014 due to the tree prior. Indeed, when the sequence assumptions of the birth–death skyline model is that the sam- data is included in the analysis, the median tree height estimate pling intensity may vary through time but it is distributed uni- turns out to be 1026 years. The tree height gets pulled down formly at random across the considered infected population at from what would be dictated by the tree prior. This estimate is any point in time. This assumption is violated for the ALL and however still higher than what would be expected under the BRAZIL data sets, since they encompass a big geographic area, clock rate prior (or the clock rate estimates in other studies), re- with sampling and sequencing efforts being higher in some re- quiring the clock to be even slower than defined by the prior gions than in others (Faria et al. 2017; Metsky et al. 2017). In con- (median estimate of the rate.mean parameter after inclusion of trast, the sequences in the Florida, USA data set were sampled sequences being 2:4 10 subst/site/year). The situation for across a smaller area formed by three neighbouring counties other prior settings for BRAZIL and also when using the Coal (Grubaugh et al. 2017) with the zone of transmission being esti- model with the Florida, USA data set is similar. mated within these counties for most individuals (only for three These findings show that the data sets of BRAZIL and USA of twenty-three sequences included in our data set had an un- sequences do not contain enough information to calibrate the known zone of transmission). In addition, the infected individ- molecular clock and thus fail to infer a proper tree height. uals in this region seemed to be sampled uniformly at random at any point in time. As we cannot estimate the clock rate for this data set, we set 3.2 Estimates of the effective reproductive number for it to the median (identical to mean) clock rate estimate we ob- the USA data set tained from analysing the full data set without the twenty-nine Florida sequences (i.e. the sequences highlighted in green in The number of secondary cases induced by one index case, i.e. the number of individuals infected by one infected individual Fig. 1 and Supplementary Figs S1 and S2). Thus, we set the clock –4 rate to 9 10 subst/site/year. We show the results in Fig. 5 us- during his/her infectious period, in a totally susceptible popula- tion is referred to as the basic reproductive number R .It is a ing six intervals for the birth rate (and thus R ) and assuming a 0 e constant sampling probability throughout the period of sample very important parameter to be estimated at the start of an epi- demic. If R < 1 the epidemic is prone to die out, if R > 1 the collection. This analysis reveals that for up to the time of the 0 0 last sample, i.e. until October 2016, the ZIKV spread fastest in disease will spread amongst the population (Anderson and May 1991). During an ongoing epidemic, some individuals recover Florida in mid-2016 and that the R dropped below 1 around August 2016. Allowing fewer intervals for the birth rate, mean- from the disease and gain immunity such that the population is not completely susceptible anymore. Further, the individuals ing only estimating a coarser pattern for R , yields R estimates e e which are averages from the finer six interval analysis, as ex- might change their behaviour and thus the number of second- ary cases caused by a single infected individual might change. pected (Supplementary Fig. S5). Figure 5. Birth–death skyline plot based on the USA data set. We used the BDSKY model allowing six intervals for R and 1 mean value for the sampling probability. Opaque coloring for the effective reproductive number, R (orange), and the sampling probability (cyan), depict the results of analyses without sequence data, the darker shades display estimates after the addition of sequence data. The vertical dashed magenta lines and crosses indicate the sampling time points of the viral se- quences. The magenta curve summarizes the lineages-through-time plot averaged over all MCMC samples without the burn-in (ﬁrst 10% of samples). For all parame- ters we show the 95% conﬁdence intervals and the median estimates (solid central line). Downloaded from https://academic.oup.com/ve/article-abstract/4/1/vex044/4829709 by Ed 'DeepDyve' Gillespie user on 21 March 2018 V. Boskova et al. | 11 We were interested to compare our R estimates to the prev- case when the sampling periods are too short or when sampling alence and incidence data reported by Pan American Health is biased, i.e. when sampling is confounded with the genetic Organization (2017) and the Florida Department of Health structure in the population (Murray et al. 2016). Repeating the (2017). However, these statistics on local transmissions only be- analyses by randomly permuting the sampling dates of the se- gin on 18 August 2016 and 30 July 2016, respectively. Both sour- quences, i.e. doing a Bayesian permutation test (Ramsden et al. ces report a drastic increase in new cases in the first half of 2008), can identify the latter but not the former bias (Duche ˆne October 2016. Since in the Florida cluster we analysed here, the et al. 2015). In contrast, a version of the Bayesian permutation sequences are sampled as early as in June 2016, we observe an test, the so-called clustered permutation test, which random- increase in R before the first reports of local transmission from ises the date assignment between sequences belonging to a PAHO/Florida Department of Health. Also, our last sample was monophyletic cluster, rather than doing the assignment com- obtained on 11 October 2016 and we therefore do not capture pletely at random, can deal with both situations when assessing the potential peak in R beyond (nor potentially slightly before) the amount of clock signal in the data (Duche ˆ ne et al. 2015). this date. However, Grubaugh et al. (2017) provide an estimate In our ZIKV data analyses, the tree height estimates from of the R in the Miami Dade county, one of the three counties TempEst are largely in agreement with the Bayesian estimates the sequence data included in our analyses comes from. Based for the ALL data set. Using LSD method, we obtain conflicting on the observed local transmission data and the total number tree height estimates compared to the Bayesian and TempEst of introduction events (travel-associated cases), they estimated estimates already for the ALL data set, which may be due to LSD the R to be 0.5–0.8. In our analyses, once we see the epidemic needing stronger clock signal in the data to produce reliable spread taking off in early March 2016, we estimate median R of estimates. 0.88, quickly rising above 1 as the epidemic progresses. Between Both TempEst and LSD provide only point estimates for the end of May and mid-July 2016 we consistently observe R > 1: clock rate and the tMRCA parameters. To explore the robustness median R between 2.1 and 2.2 and 95% HPD¼(1.1, 3.1). Prior to of these estimates, and to obtain the confidence intervals, indi- March 2016 we do not see the epidemic spread and estimate rect approaches such as a bootstrap procedure has to be per- median R between 0.6 and 0.88 (95% HPD: 0.05–1.8), which formed. Furthermore, these estimates are based on a single would be in line with Grubaugh et al. estimates. The discrep- input tree. In case of limited data, the tree reconstruction ancy between our R estimates after March 2016 and the R esti- method and settings, e.g. evolutionary model, chosen can e 0 mates of Grubaugh et al. could stem from the estimates of strongly influence the results of these analyses (compare Fig. 2 Grubaugh et al. being based on the infected patient count data where the input three is reconstructed using ML procedure, and reported by the public health agencies, which could have poten- Supplementary Fig. S6, where the tree has been reconstructed tially missed a lot of mild or asymptomatic ZIKV infections. using another popular method, namely neighbour-joining algo- rithm (Saitou and Nei 1987). 4. Discussion 4.2 Analyses using Bayesian phylodynamic method As soon as a new virus starts to spread within the human popu- Bayesian methods are an alternative to the point estimate lation, it is important to estimate the potential impact of this methods. In the Bayesian framework, the uncertainty in the epidemic on a global scale. To obtain trustworthy parameter es- tree topology can be naturally integrated out by sampling over timates, the sequences must contain enough signal regarding many plausible topologies. In addition, the result of a Bayesian the quantities of interest. Testing the sequence data for a signal inference is a distribution of parameter values, rather than a on the molecular clock rate is therefore necessary. single point estimate, providing a measure of uncertainty in the estimate. Furthermore, results can easily be tested for robust- 4.1 Analyses using point estimate methods ness across models and across priors for parameters to investi- Point estimate methods allow for dating the tree and obtaining gate the amount of signal contained in the data about any an estimate for the molecular clock rate. Here, we explored such particular parameter. estimates using TempEst (Rambaut et al. 2016) and LSD tree dat- Bayesian phylodynamic analysis requires model choices and ing (Guindon et al. 2010; To et al. 2015). Tools such as TempEst decisions on prior settings before analysing the sampled se- (Rambaut et al. 2016) that visualise the regression of the root-to- quence data. Making these choices is not trivial. Despite the fact tip divergence with the calendar time can be helpful in deter- that prior distribution for in total four parameters, i.e. origin, be- mining the amount of clock signal in the data. There are only come uninfectious rate, effective reproductive number and the few rules on how exactly these tools should be used. The au- sampling probability, needs to be specified in the birth–death thors of the TempEst say that ‘[...] the estimation of a negative model, this task may be easier than specifying the prior distri- evolutionary rate indicates that the data set contains little or no bution on the single parameter, i.e. effective population size, of temporal signal’ (Rambaut et al. 2016). They also point out that the coalescent model. Epidemiological surveillance data often the amount of signal in the data can be assessed visually or provide information on the duration of the infection in the pa- through the correlation coefficient, R , provided by the method. tient and also on the sampling probability. Furthermore, assum- However, guidelines that describe how to exactly perform this ing R around 1 is reasonable for most infections. However, visual inspection, or what the reasonable cutoff for R is to de- there is usually no information on the effective infected popula- termine if the information in the data is sufficient to allow for tion size, which is in fact a different quantity than the infected the separation of the clock rate and the tree height signal are population size. lacking. Assessment of the clock signal in the data using this A good practice in Bayesian phylodynamics is to first run the tool is therefore very subjective. analysis without the sequence data and only add the sequence Such regression methods as well as a molecular clock selec- data in a second run. Such analyses can be very revealing about tion approach have been found to lead to wrong estimates of both the amount of information in the sequences and the as- the amount of clock signal in the data. This is especially the sumptions of the underlying models. Analysing just the Downloaded from https://academic.oup.com/ve/article-abstract/4/1/vex044/4829709 by Ed 'DeepDyve' Gillespie user on 21 March 2018 12 | Virus Evolution, 2018, Vol. 4, No. 1 sampling times of ZIKV data sets with the coalescent tree prior height distributions interact once the sequences are added. This led to wide distributions for the tree height parameter. When phenomenon has been also recently observed in simulations the sequence data was included, the tree height distributions (Mo ¨ ller 2017). peaked for the case of the ALL data set but remained very broad Only when we are guaranteed that the data contains enough for the BRAZIL and the USA sequences only, indicating a lack of signal for the clock calibration, we can proceed with interpret- signal in the two latter data sets. Even though we assumed a ing phylodynamic parameters. In case of limited data, it is pos- flat prior on the origin, the birth–death model analyses yielded sible to carry out the analyses by fixing one of the parameters: peaked estimates of the tree height already when just the infor- the clock rate or the tree height. Fixing the former is straightfor- mation on the sampling times was included. This is due to the ward as the clock rate is a parameter for which a prior is chosen, fact that the sampling process is part of the birth–death model however, fixing the latter is more complex as the tree height outcome. Thus, the sampling times already provide information prior distribution is induced by the prior distributions on the about the epidemiological dynamics. In this sense, the birth– model parameters. One may simply fix the MRCA node height, death models are very similar to classic epidemiology, where i.e. use a calibration node. However, superimposing of such the sampling times without sequences are used for estimation MRCA node age prior on top of the population dynamic prior of epidemiological parameters. can lead to unexpected actual prior distributions (Heled and If the birth–death model is an appropriate model regarding Drummond 2015). In addition to the technical issue with fixing the dynamics of the (sampling and epidemiological) process, the clock rate or the tree height parameters in the analyses, sampling times provide useful information regarding the pa- availability of reliable estimates also influences which parame- rameters of the process. However, if the birth–death model does ter one chooses to fix. Clock rate estimates are often available not describe the process correctly, a peaked distribution ob- for many data sets. Fixing this value and assuming the sam- tained based on sampling times can be misleading, especially if pling process is unbiased across the population at any point in this distribution is shifted away from the true parameter value. time, we can infer epidemiological parameters using the birth– When sequence data is combined with such shifted distribu- death (skyline) model. It has been shown before, that if the sam- tions, and the information in the sequences is not strong pling process is mis-specified in the birth–death model, e.g. if it enough to push the distribution to the correct values, wrong is fixed to a high value, while the true value is low, then other conclusions about important epidemiological parameters may epidemiological parameters will be biased too (see effect of mis- be drawn. specification due to parameter correlations in Boskova, Sampling from an actual prior distribution under the birth– Bonhoeffer, and Stadler (2014)). If the sampling process is biased death model involves sampling the number of samples and the in such a way that certain parts of the transmission tree are sampling times along with all other parameters. One option to sampled more than others, and one fails to capture those varia- obtain such a prior distribution under the birth–death model tions in the model both the birth–death and the coalescent would be to simulate trees under a given set of epidemiological model may produce biases as both of these models assume that parameters using, e.g. MASTER (Vaughan and Drummond 2013) at each point in time, a random sample is taken from the full or TreeSim package in R (Stadler 2011). If such sampling was process. carried out, one would obtain prior distributions on all parame- We performed a phylodynamic analysis on the Florida, USA ters together with prior distribution of the sample number and data set, by fixing the clock rate, and find evidence for the in- sampling times. In such a case, if the prior distribution on the crease in spread of the ZIKV in Florida in mid-2016 followed by origin parameter was specified to be uniform, sampling from the decline in spread after August 2016. We used the birth– the actual prior will produce a uniform prior distribution for the death skyline model with up to 6 intervals for R to analyse the origin parameter. However, the prior distribution for the tree USA data. The reason for not increasing this number further height parameter would be defined by the exact combination of was that we did not want to over-parametrise the model. In the birth rate, the death rate and the sampling probability particular, the data set only consisted of 23 sequences and in- parameter. cluding more parameters to estimate in the model would only lead to wider posterior intervals. An ideal alternative to the simple skyline model would be to use a smoothing prior, inform- 4.3 Analysis of limited sequence data using Bayesian ing each next interval by the results gathered from the previous phylodynamics one. In the future, we hope such prior will become available for the birth–death skyline model, allowing for more detailed Sequence data carry information for two distinct parameters: analyses of small data sets such as the ZIKV Florida cluster ex- the tree topology and the clock rate. If the sequences are not di- plored here. vergent enough from one another, the topology will be hard to estimate. If the sequences are sampled close to one another in time, relative to the substitution rate, even if the topology is cor- 5. Conclusions rectly resolved, it will be hard to tease apart the tree height and the clock (substitution) rate. The former problem is dealt with Based on our results presented here, we suggest to perform using the Bayesian analysis, since the tree topology can be Bayesian phylodynamic analyses during ongoing epidemics treated as a nuisance parameter and is integrated out. When se- with highest care with respect to model robustness. Instead of quences do not contain enough information to calibrate the mo- using just one set of priors to perform an analysis, it is essential lecular clock, only the product of the clock rate and the tree to check the robustness of the estimates obtained with the cho- height can be estimated reliably. Here, we have shown that sen models and different prior settings. If the data has enough once we attempt to decompose this product into the clock rate phylogenetic and clock signal, the posterior distributions will be and tree height estimates using insufficient data, severe prob- consistent over the range of models chosen. Only then, the pos- lems arise. Sequences with little information on the clock rate terior distributions can reliably reflect important parameters may shift the estimates in a wrong direction (compared to the such as the effective reproductive number, the start or the size estimates without sequences) as the clock rate and the tree of the epidemic. Downloaded from https://academic.oup.com/ve/article-abstract/4/1/vex044/4829709 by Ed 'DeepDyve' Gillespie user on 21 March 2018 V. Boskova et al. | 13 Grubaugh, N. D. et al. (2017) ‘Genomic Epidemiology Reveals Acknowledgements Multiple Introductions of Zika Virus into the United States’, The authors express their gratitude to Louis du Plessis for Nature, 546: 401–05. helpful discussions and for making his skyline plotting Guindon, S. et al. (2010) ‘New Algorithms and Methods to scripts and TreeSlicer class from unreleased BEAST2 pack- Estimate Maximum-Likelihood Phylogenies: Assessing the age smoothingpriors (https://github.com/laduplessis/smooth Performance of Phyml 3.0’, Systematic Biology, 59: 307–21. ingpriors) available. V.B., T.S., and C.M. thank ETH Zu ¨ rich for Heled, J., and Drummond, A. J. (2015) ‘Calibrated Birth–Death funding. T.S. and V.B. were supported by the European Phylogenetic Time-Tree Priors for Bayesian Inference’, Research Council under the Seventh Framework Systematic Biology, 64: 369–83. Programme of the European Commission (PhyPD: grant Kampstra, P. (2008) ‘Beanplot: A Boxplot Alternative for Visual Comparison of Distributions’, Journal of Statistical Software, 28: agreement number 335529). 1–9. Kingman, J. F. (1982) ‘On the Genealogy of Large Populations’, Journal of Applied Probability, 19: 27–43. Authors’ contributions Korber, B. et al. (2000) ‘Timing the Ancestor of the HIV-1 Pandemic Strains’, Science, 288: 1789–96. V.B., T.S., and C.M. designed the study and the analyses. V.B. Metsky, H. C. et al. (2017) ‘Zika Virus Evolution and Spread in the performed the analyses. V.B., T.S., and C.M. wrote the paper. Americas’, Nature, 546: 411–15. Mo ¨ ller, S. (2016) ‘The Impact of the Tree Prior and Purifying Selection on Estimating Clock Rates during Viral Epidemics’, ETH Zurich. Supplementary data Murray, G. G. et al. (2016) ‘The Effect of Genetic Structure on Supplementary data are available at Virus Evolution online. Molecular Dating and Tests for Temporal Signal’, Methods in Ecology and Evolution, 7: 80–9. Nee, S. (2006) ‘Birth-Death Models in Macroevolution’, Annual References Review of Ecology, Evolution, and Systematics, 37: 1–17. Anderson, R. M., and May, R. M. (1991) Infectious Diseases of Neuwirth, E. (2014) Rcolorbrewer: Colorbrewer Palettes. R Humans: Dynamics and Control. Oxford: Oxford University Press. package version 1.1-2. Boskova, V., Bonhoeffer, S., and Stadler, T. (2014) ‘Inference of Oehler, E. et al. (2014) ‘Zika Virus Infection Complicated by Epidemiological Dynamics Based on Simulated Phylogenies Guillain-Barre Syndrome—Case Report, French Polynesia, December 2013’, Eurosurveillance, 19: pii: 20720. Using Birth-Death and Coalescent Models’, PLoS Computational Biology, 10: e1003913. Pan American Health Organization (PAHO) (2017) <http://www. Bouckaert, R. et al. (2014) ‘BEAST 2: A Software Platform for paho.org> accessed 9 June 2017. Bayesian Evolutionary Analysis’, PLoS Computational Biology, 10: Pan American Health Organization, and World Health Organization Regional Ofﬁce for the Americas (2015) e1003537. Drummond, A. J. et al. (2005) ‘Bayesian Coalescent Inference of ‘Epidemiological Alert: Neurological Syndrome, Congenital past Population Dynamics from Molecular Sequences’, Malformations, and Zika Virus Infection’. Implications for Public Molecular Biology and Evolution, 22: 1185–92. Health in the Americas. Technical report. <http://www.paho. org/hq/index.php?option¼com_docman&task¼doc_download et al. (2006) ‘Relaxed Phylogenetics and Dating with Conﬁdence’, PLoS Biology, 4: e88. &Itemid¼&gid¼32405&lang¼en>. Paradis, E., Claude, J., and Strimmer, K. (2004) ‘APE: Analyses of , and Rambaut, A. (2007) ‘BEAST: Bayesian Evolutionary Analysis by Sampling Trees’, BMC Evolutionary Biology, 7: 214. Phylogenetics and Evolution in R Language’, Bioinformatics du Plessis, L., and Stadler, T. (2015) ‘Getting to the Root of (Oxford, England), 20: 289–90. Epidemic Spread with Phylodynamic Analysis of Genomic R Core Team (2013) R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Data’, Trends in Microbiology, 23: 383–6. Duche ˆ ne, S. et al. (2015) ‘The Performance of the Austria. <http://www.R-project.org/>. Date-Randomization Test in Phylogenetic Analyses of Rambaut, A. (2012) Figtree v1.4. Molecular evolution, phyloge- Time-Structured Virus Data’, Molecular Biology and Evolution, 32: netics and epidemiology. Edinburgh, UK: University of Edinburgh, Institute of Evolutionary Biology. <http://tree.bio. 1895–906. Edgar, R. C. (2004) ‘MUSCLE: Multiple Sequence Alignment with ed.ac.uk/software/ﬁgtree/>. High Accuracy and High Throughput’, Nucleic Acids Research, et al. (2014) Tracer v1.6, Molecular evolution, phylogenetics 32: 1792–7. and epidemiology, Edinburgh. UK: University of Edinbugh, Faria, N. R. et al. (2016) ‘Zika Virus in the Americas: Early Institute of Evolutionary Biology. <http://beast.bio.ed.ac.uk/ Epidemiological and Genetic Findings’, Science, 352: 345–9. tracer>. et al. (2016) ‘Exploring the Temporal Structure of et al. (2017) ‘Establishment and Cryptic Transmission of Zika Virus in Brazil and the Americas’, Nature, 546: 406–10. Heterochronous Sequences Using Tempest (Formerly Ferguson, N. M. et al. (2016) ‘Countering the Zika Epidemic in Path-o-Gen)’, Virus Evolution, 2: vew007. Latin America’, Science, 353: 353–4. , and Drummond, A. (2015) Treeannotator v2.3.0. Edinburgh, UK: University of Edinburgh, Institute of Florida Department of Health (2017) <http://www.ﬂoridahealth. gov/diseases-and-conditions/zika-virus/index.html> ac- Evolutionary Biology and Auckland, NZ: University of cessed 26 July 2017. Auckland, Department of Computer Science. Grenfell, B. T. et al. (2004) ‘Unifying the Epidemiological and Ramsden, C. et al. (2008) ‘High Rates of Molecular Evolution in Evolutionary Dynamics of Pathogens’, Science, 303: 327–32. Hantaviruses’, Molecular Biology and Evolution, 25: 1488–92. Downloaded from https://academic.oup.com/ve/article-abstract/4/1/vex044/4829709 by Ed 'DeepDyve' Gillespie user on 21 March 2018 14 | Virus Evolution, 2018, Vol. 4, No. 1 To, T. H., et al. (2015) ‘Fast Dating Using Least-Squares Criteria Revell, L. J. (2012) ‘Phytools: An R Package for Phylogenetic Comparative Biology (and Other Things)’, Methods in Ecology and Algorithms’, Systematic Biology, 65: 82–97. and Evolution, 3: 217–23. Vaughan, T. G., and Drummond, A. J. (2013) ‘A Stochastic Saitou, N., and Nei, M. (1987) ‘The Neighbor-Joining Method: A Simulator of Birth–Death Master Equations with New Method for Reconstructing Phylogenetic Trees’, Molecular Application to Phylodynamics’, Molecular Biology and Evolution, Biology and Evolution, 4: 406–25. 30: 1480–93. Smith, B. J. (2007) ‘boa: An R Package for MCMC Output Ventura, C. V. et al. (2016) ‘Zika Virus in Brazil and Macular Convergence Assessment and Posterior Inference’, Journal of Atrophy in a Child with Microcephaly’, The Lancet, 387: 228. Statistical Software, 21: 1–37. virological.org (2016). <http://virological.org> accessed 30 Nov Stadler, T. (2010) ‘Sampling-through-Time in Birth-Death Trees’, 2016. Journal of Theoretical Biology, 267: 396–404. World Health Organization (WHO) (2016) Geneva <http://www. (2011) ‘Simulating Trees with a Fixed Number of Extant who.int/emergencies/zika-virus/quarterly-update-october/en/ Species’, Systematic Biology, 60: 676–84. > accessed 1 Dec 2016. et al. (2013) ‘Birth–Death Skyline Plot Reveals Temporal —— (2016) ‘Zika Situation Report – Neurological syndrome and Changes of Epidemic Spread in HIV and Hepatitis C Virus congenital anomalies. Technical report’ <http://apps.who. (HCV)’, Proceedings of the National Academy of Sciences of the int/iris/bitstream/10665/204348/1/zikasitrep_5Feb2016_eng.pdf? United States of America, 110: 228–33. ua¼1>. Downloaded from https://academic.oup.com/ve/article-abstract/4/1/vex044/4829709 by Ed 'DeepDyve' Gillespie user on 21 March 2018

Journal

Virus Evolution – Oxford University Press

Published: Jan 1, 2018

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

The influence of phylodynamic model specifications on parameter estimates of the Zika virus epidemic

The influence of phylodynamic model specifications on parameter estimates of the Zika virus epidemic

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

The influence of phylodynamic model specifications on parameter estimates of the Zika virus epidemic

The influence of phylodynamic model specifications on parameter estimates of the Zika virus epidemic

References (42)

Abstract

Journal

Recommended Articles

There are no references for this article.

Our policy towards the use of cookies