TY - JOUR AU - Russell, Colin A AB - Abstract Subspecies nomenclature systems of pathogens are increasingly based on sequence data. The use of phylogenetics to identify and differentiate between clusters of genetically similar pathogens is particularly prevalent in virology from the nomenclature of human papillomaviruses to highly pathogenic avian influenza (HPAI) H5Nx viruses. These nomenclature systems rely on absolute genetic distance thresholds to define the maximum genetic divergence tolerated between viruses designated as closely related. However, the phylogenetic clustering methods used in these nomenclature systems are limited by the arbitrariness of setting intra and intercluster diversity thresholds. The lack of a consensus ground truth to define well-delineated, meaningful phylogenetic subpopulations amplifies the difficulties in identifying an informative distance threshold. Consequently, phylogenetic clustering often becomes an exploratory, ad hoc exercise. Phylogenetic Clustering by Linear Integer Programming (PhyCLIP) was developed to provide a statistically principled phylogenetic clustering framework that negates the need for an arbitrarily defined distance threshold. Using the pairwise patristic distance distributions of an input phylogeny, PhyCLIP parameterizes the intra and intercluster divergence limits as statistical bounds in an integer linear programming model which is subsequently optimized to cluster as many sequences as possible. When applied to the hemagglutinin phylogeny of HPAI H5Nx viruses, PhyCLIP was not only able to recapitulate the current WHO/OIE/FAO H5 nomenclature system but also further delineated informative higher resolution clusters that capture geographically distinct subpopulations of viruses. PhyCLIP is pathogen-agnostic and can be generalized to a wide variety of research questions concerning the identification of biologically informative clusters in pathogen phylogenies. PhyCLIP is freely available at http://github.com/alvinxhan/PhyCLIP, last accessed March 15, 2019. phylogenetic clustering, molecular epidemiology, influenza, pathogen, nomenclature Introduction Advances in high-throughput sequencing technology and computational approaches in molecular epidemiology have seen sequence data increasingly integrated into clinical care, surveillance systems, and epidemiological studies (Gardy and Loman 2017). Based on the growing number of available pathogen sequences genomic epidemiology has yielded a wealth of information on epidemiological and evolutionary questions ranging from transmission dynamics to genotype–phenotype correlations. Central to all of these questions is the need for robust and consistent nomenclature systems to describe and partition the genetic diversity of pathogens to meaningfully relate to epidemiological, evolutionary, or ecological processes. Increasingly, nomenclature systems for pathogens below the species level are based on sequence information, supplementing, or even displacing conventional biological properties such as serology or host range (Simmonds et al. 2010; McIntyre et al. 2013). However, existing sequence-based nomenclature frameworks for defining lineages, clades or clusters in pathogen phylogenies are mostly based on arbitrary and inconsistent criteria. Standardizing the definition of a phylogenetic cluster or lineage across pathogens is complicated by differences in characteristics such as genome organization and maintenance ecology. Cluster definitions vary widely even between studies of the same pathogen, limiting generalization, and interpretation between studies as designated clusters, clades, and/or lineages carry inconsistent information in the larger evolutionary context (Grabowski et al. 1904; Dennis et al. 2014; Hassan et al. 2017). In virology, nomenclature systems are largely reliant on absolute distance thresholds that define the maximum genetic divergence tolerated between viruses designated as closely related (Burk et al. 2011; Van Doorslaer et al. 2011; Lauber and Gorbalenya 2012; Donald et al. 2013; Kroneman et al. 2013; Poon et al. 2015, 2016; Smith et al. 2015; Valastro et al. 2016). Groups of closely related viruses are inferred to be phylogenetic clusters when the genetic distance between them is lower than the limit set on within-cluster divergence. Nonparametric distance-based clustering approaches have defined the distance between sequences using pairwise genetic distances calculated directly from sequence data (WHO/OIE/FAO H5N1 Evolution Working Group 2008; Aldous et al. 2012; Ragonnet-Cronin et al. 2013) or pairwise patristic distances calculated from inferred phylogenetic trees (Hué et al. 2004; Prosperi et al. 2011; Poon et al. 2015; Pu et al. 2015; Ortiz and Neuzil 2017). Within-cluster limits on tolerated divergence have been set using mean (WHO/OIE/FAO H5N1 Evolution Working Group 2008), median (Prosperi et al. 2011), or maximum within-cluster pairwise genetic or patristic distance (Ragonnet-Cronin et al. 2013). Some methods incorporate additional criteria, such as the statistical support for subtrees under consideration or minimum/maximum cluster size (Hué et al. 2004; Prosperi et al. 2010, 2011; Ragonnet-Cronin et al. 2013). These genetic distance-based clustering approaches are convenient, as they are rule-based and scalable, allowing for relatively easy nomenclature updates. Arguably, flexibility in the distance thresholds allows researchers to curate clusters based on consistency of the geographic or temporal metadata. The central limitation of approaches based on pairwise genetic or patristic distance is that thresholds to define meaningful within- and between-cluster diversity are arbitrary. For most pathogens, there is no clear definition of a well-delineated phylogenetic unit to underlie nomenclature designation or suggest what additional information would be informative to delineate subpopulations, for example, information on antigenicity or geography or host range. Resultantly, there is no ground truth to optimize distance thresholds when developing a nomenclature system for most pathogens. Partitioning phylogenetic trees into meaningful subsets is therefore complex and is mostly performed ad hoc through exploratory analyses with uninformative sensitivity analyses across thresholds. As expected, cluster membership is highly sensitive to the threshold applied and therefore results can be unstable across different cluster definitions (Rose et al. 2017). There is a need for a consistent, automated and robust statistical framework for determining cluster-defining criteria in nomenclature frameworks. Here, we describe a statistically principled phylogenetic clustering approach called Phylogenetic Clustering by Linear Integer Programming (PhyCLIP). PhyCLIP is based on integer linear programming (ILP) optimization, with the objective to assign statistically principled cluster membership to as many sequences as possible. We apply PhyCLIP to the hemagglutinin (HA) phylogeny of the highly pathogenic avian influenza (HPAI) A/goose/Guangdong/1/1996 (Gs/GD)-like lineage of the H5Nx subtype viruses, which underlies the most prominent nomenclature system for avian influenza viruses and which itself is based on a genetic distance approach (WHO/OIE/FAO H5N1 Evolution Working Group 2008). PhyCLIP is freely available on github (http://github.com/alvinxhan/PhyCLIP, last accessed March 15, 2019) and documentation can be found on the associated wiki page (http://github.com/alvinxhan/PhyCLIP/wiki, last accessed March 15, 2019). New Approach PhyCLIP requires an input phylogeny and three user-provided parameters: Minimum number of sequences (S) that should be considered a cluster. Multiple of deviations (γ) from the grand median of the mean pairwise sequence patristic distance that defines the within-cluster divergence limit (WCL) False discovery rate (FDR) to infer that the diversity observed for every combinatorial pair of output clusters is significantly distinct from one another. Figure 1A shows the workflow of PhyCLIP which is further elaborated here. First, PhyCLIP considers the input phylogenetic tree as an ensemble of N monophyletic subtrees (including the root) that could potentially be clustered as a single phylogenetic cluster, each defined by an internal node i subtending a set of sequences Li (fig. 1B, see “Materials and Methods” section). Consequently, as the topological structure of the phylogenetic tree is incorporated in the cluster structure, it is possible to infer the evolutionary trajectory of the output clusters of PhyCLIP if the tree is appropriately rooted. For clarity, we use the term subtree to refer to the set of sequences subtended under the same node that could potentially be clustered and the term cluster to refer to sequences that are clustered by PhyCLIP within the same subtree. Fig. 1. Open in new tabDownload slide Schematics of PhyCLIP workflow and inference. (A) Workflow of PhyCLIP. Apart from an appropriately rooted phylogenetic tree, users only need to provide S ⁠, γ, and FDR as the inputs for PhyCLIP. After determining the within-cluster WCL, PhyCLIP dissociates distantly related subtrees and outlying sequences that inflate the mean patristic distance (⁠ μi ⁠) of ancestral subtrees. The ILP model is then implemented and optimized to assign cluster membership to as many sequences as possible. If a prior of cluster membership is given, this is followed by a secondary optimization to retain as much of the prior membership as is statistically supportable within the limits of PhyCLIP. Post-ILP optimization clean-up steps are taken before yielding finalized clustering output. (B) PhyCLIP considers the phylogeny as an ensemble of monophyletic subtrees, each defined by an internal node (circled numbers) subtended by a set of sequences (letters encapsulated within shaded region of the same color as the circled number). In this example, only subtrees with ≥3 sequences (⁠ S=3 ⁠) are considered for clustering by the ILP model but WCL is determined from μi of all subtrees, including the unshaded subtrees 6–8. Only subtrees where μi≤WCL are eligible for clustering. (C) Subtrees o and q, as well as sequence j9 are dissociated from subtree ias they are exceedingly distant from i. If sequences j1 ⁠, j2 ⁠, j4 and j5 are clustered under subtree h whereas j3 is clustered under subtree i by ILP optimization, a post-ILP clean up step will remove j3 from cluster i. Fig. 1. Open in new tabDownload slide Schematics of PhyCLIP workflow and inference. (A) Workflow of PhyCLIP. Apart from an appropriately rooted phylogenetic tree, users only need to provide S ⁠, γ, and FDR as the inputs for PhyCLIP. After determining the within-cluster WCL, PhyCLIP dissociates distantly related subtrees and outlying sequences that inflate the mean patristic distance (⁠ μi ⁠) of ancestral subtrees. The ILP model is then implemented and optimized to assign cluster membership to as many sequences as possible. If a prior of cluster membership is given, this is followed by a secondary optimization to retain as much of the prior membership as is statistically supportable within the limits of PhyCLIP. Post-ILP optimization clean-up steps are taken before yielding finalized clustering output. (B) PhyCLIP considers the phylogeny as an ensemble of monophyletic subtrees, each defined by an internal node (circled numbers) subtended by a set of sequences (letters encapsulated within shaded region of the same color as the circled number). In this example, only subtrees with ≥3 sequences (⁠ S=3 ⁠) are considered for clustering by the ILP model but WCL is determined from μi of all subtrees, including the unshaded subtrees 6–8. Only subtrees where μi≤WCL are eligible for clustering. (C) Subtrees o and q, as well as sequence j9 are dissociated from subtree ias they are exceedingly distant from i. If sequences j1 ⁠, j2 ⁠, j4 and j5 are clustered under subtree h whereas j3 is clustered under subtree i by ILP optimization, a post-ILP clean up step will remove j3 from cluster i. The within-cluster internal diversity of subtree i is measured by its mean pairwise sequence patristic distance (⁠ μi ⁠). PhyCLIP calculates the WCL, an upper bound to the internal diversity of a cluster, as: WCL=μ-+γσ,(1) where μ- is the grand median of the mean pairwise patristic distance distribution μ1,μ2,…,μi,…,μN and σ is any robust estimator of scale (e.g., median absolute deviation MAD or Qn ⁠, see “Materials and Methods” section) that quantifies the statistical dispersion of the mean pairwise patristic distance distribution for the ensemble of N subtrees. In other words, only subtrees with μi≤WCL will be considered for clustering by PhyCLIP (fig. 1B). Distal Dissociation The assumption that a cluster must be monophyletic can lead to incorrect assignment of cluster membership to undersampled, distantly related outlying sequences that have diverged considerably from the rest of the cluster (e.g., sequence j9 in fig. 1C). These exceedingly distant outlying sequences can also inflate μi of the subtree they subtend, leading to inaccurate overestimation of the internal divergence of the putative subtree. Similarly, distantly related descendant subtrees can artificially inflate μi of their ancestral trunk nodes (e.g., nodes o and q in fig. 1C). In turn, historical sequences immediately descending from a trunk node i will not be clustered if its μi exceeds WCL (fig. 1C). PhyCLIP dissociates any distal subtrees and/or outlying sequences from their ancestral lineage prior to implementing the ILP model. For any subtree i with μi>WCL ⁠, starting from the most distant sequence to i ⁠, PhyCLIP applies a leave-one-out strategy dissociating sequences, and the whole descendant subtree if every sequence subtended by it was dissociated, until the recalculated μi without the distantly related sequences falls below WCL ⁠. For each subtree, PhyCLIP also tests and dissociates any outlying sequences present. An outlying sequence is defined as any sequence whose patristic distance to the node in question is >3× the estimator of scale away from the median sequence patristic distance to the node. μi is recalculated for any node with changes to its sequence membership Li after dissociating these distantly related sequences. These distal dissociation steps effectively offer PhyCLIP greater flexibility in its clustering construct allowing the identification of paraphyletic clusters on top of monophyletic ones that may better reflect the phylogenetic relationships of these sequences. Integer Linear Programming Optimization The full formulation of the ILP model is detailed in “Materials and Methods” section. Here, we broadly describe how the optimization algorithm proceeds to delineate the input phylogeny. The primary objective of PhyCLIP is to cluster as many sequences in the phylogeny as possible subject to the following constraints: All output clusters must contain ≥S number of sequences. All output clusters must satisfy μi≤WCL ⁠. The pairwise sequence patristic distance distribution of every combinatorial pair of output clusters must be significantly distinct from the resultant cluster if sequences from the pair of clusters were to combine. This is the intercluster divergence constraint and herein, statistical significance is inferred if the multiple-testing corrected P value for the cluster pair is 25th percentile of PhyCLIP’s output cluster size distribution (i.e., for having proliferated in numbers substantial enough to be deemed a progeny cluster). Every child cluster of a supercluster is assigned a progeny number separated by a decimal point (e.g., 1.2 refers to the second child cluster of supercluster 1). However, descendant clusters that fall below the cluster size cut-off are distinguished from child clusters as nested clusters, each assigned an address in the form of a parenthesized letter, alphabetized by tree traversal order, prefixed by its parent supercluster nomenclature (e.g., 1.1[c] refers to the third nested cluster of supercluster 1.1). Nested clusters in superclusters fundamentally have different properties from the sensitivity-induced nested clusters discussed in “New Approach” section and cannot be subsumed as it will violate the within-cluster limit of the parent supercluster. The structure of the resultant clustering topology is highlighted in figure 3. Phylogenetic Analyses PhyCLIP’s performance was evaluated on an empirical data set. The sequence data sets used to construct the HA gene phylogenetic trees underlying the WHO/OIE/FAO nomenclature for the A/goose/Guangdong/1/1996 (Gs/GD/96)-like H5 avian influenza viruses were downloaded from GISAID (WHO/OIE/FAO H5N1 Evolution Working Group 2008; WHO/OIE/FAO H5N1 Evolution Working Group 2012; WHO/OIE/FAO H5N1 Evolution Working Group 2014; Smith et al. 2015). The primary analysis is based on the full data set included in the 2009 (n = 1,224) and 2015 (n = 4,357) nomenclature updates. Viruses that were inconsistently included across WHO/OIE/FAO updates were followed up and included (WHO/OIE/FAO HN Evolution Working Group 2009; Smith et al. 2015). Sequences were curated based on criteria defined by the H5 nomenclature: sequences with more than 5 ambiguous nucleotides, with a sequence length shorter than 60% of the alignment, or with frameshifts or duplicated by name were removed. For the 2018 phylogeny, all avian and human viruses from the Gs/GD-like H5 lineage were downloaded from GISAID up to April 2018, including H5Nx subtypes H5N2, H5N3, H5N5, H5N6, and H5N8. An alternative filtering approach compared to the published WHO nomenclature approach was applied to ensure a data set of high-quality sequences that would be robust to error in phylogenetic reconstruction as PhyCLIP is inherently sensitive to topological information. In this approach, duplicate sequences and sequences with a length below 95% of the full HA sequence or more than 1% ambiguous nucleotides were discarded. Sequences were aligned with MAFFT v7.397 and trimmed to the start of the mature protein (Katoh et al. 2002). Each sequence set was annotated with the WHO/OIE/FAOH5 nomenclature using LABEL(v0.5.2), and the version of the module corresponding to the nomenclature update of the data set (e.g., H5v2015 module for the full tree from the nomenclature update in 2015) (Shepard et al. 2014). Maximum likelihood phylogenetic trees were constructed for each data set with RAxML 8.2.12 under the GTR+GAMMA substitution model, and rooted to Gs/GD/96 (Stamatakis 2014). Phylogenetic trees were visualized using Figtree (http://tree.bio.ed.ac.uk/software/figtree/; last accessed March 15, 2019) and ggtree (Yu et al. 2017). Silhouette Index The silhouette index is based on the distance, here patristic distance, of each cluster member to other cluster members compared with the distance to its nearest neighbors (Rousseeuw 1987). Silhouette values approaching one indicate that the cluster member is correctly assigned, whereas values close to zero indicate that the sequence is equally matched to its neighboring cluster. A negative Silhouette index indicates that the sequence is more closely related to the neighboring cluster than to its fellow cluster members. Calculation of the silhouette index was performed in R (R Core Team 2016). Code Availability PhyCLIP is freely available on github (http://github.com/alvinxhan/PhyCLIP; last accessed March 15, 2019) and documentation can be found on the associated wiki page (http://github.com/alvinxhan/PhyCLIP/wiki; last accessed March 15, 2019). Supplementary Material Supplementary data are available at Molecular Biology and Evolution online. Acknowledgments We thank the GISAID Initiative and the influenza surveillance and research groups that openly shared the genetic sequence data that made this work possible (full acknowledgement table is available as supplementary). A.X.H. was supported by the A*STAR Graduate Scholarship programme from A*STAR to carry out his PhD work via collaboration between Bioinformatics Institute (A*STAR) and NUS Graduate School for Integrative Sciences and Engineering from the National University of Singapore. E.P. was funded by the Gates Cambridge Trust (Grant number OPP1144). S.M.S. was supported by the A*STAR HEIDI programme (Grant number: H1699f0013) and Bioinformatics Institute (A*STAR). C.A.R. was supported by University Research Fellowship from the Royal Society. References Aldous JL , Pond SK, Poon A, Jain S, Qin H, Kahn JS, Kitahata M, Rodriguez B, Dennis AM, Boswell SL, et al. . 2012 . Characterizing HIV transmission networks across the United States . Clin Infect Dis . 55 ( 8 ): 1135 – 1143 . Google Scholar Crossref Search ADS PubMed WorldCat Anisimova M , Gil M, Dufayard J-F, Dessimoz C, Gascuel O. 2011 . Survey of branch support methods demonstrates accuracy, power, and robustness of fast likelihood-based approximation schemes . Syst Biol . 60 ( 5 ): 685 – 699 . Google Scholar Crossref Search ADS PubMed WorldCat Boskova V , Stadler T, Magnus C. 2018 . The influence of phylodynamic model specifications on parameter estimates of the Zika virus epidemic . Virus Evol . 4 :1 – 14 . Google Scholar Crossref Search ADS WorldCat Burk RD , Chen Z, Harari A, Smith BC, Kocjan BJ, Maver PJ, Poljak M. 2011 . Classification and nomenclature system for human Alphapapillomavirus variants: general features, nucleotide landmarks and assignment of HPV6 and HPV11 isolates to variant lineages . Acta Dermatovenerol Alp Pannonica Adriat . 20 : 113 – 123 . Google Scholar PubMed OpenURL Placeholder Text WorldCat Dennis AM , Herbeck JT, Brown AL, Kellam P, de Oliveira T, Pillay D, Fraser C, Cohen MS. 2014 . Phylogenetic studies of transmission dynamics in generalized HIV epidemics: an essential tool where the burden is greatest? J Acquir Immune Defic Syndr . 67 ( 2 ): 181 – 195 . Google Scholar Crossref Search ADS PubMed WorldCat Drummond AJ , Pybus OG, Rambaut A, Forsberg R, Rodrigo AG. 2003 . Measurably evolving populations . Trends Ecol Evol . 18 ( 9 ): 481 – 488 . Google Scholar Crossref Search ADS WorldCat Duan L , Bahl J, Smith GJD, Wang J, Vijaykrishna D, Zhang LJ, Zhang JX, Li KS, Fan XH, Cheung CL, et al. . 2008 . The development and genetic diversity of H5N1 influenza virus in China, 1996–2006 . Virology 380 ( 2 ): 243 – 254 . Google Scholar Crossref Search ADS PubMed WorldCat Gardy JL , Loman NJ. 2017 . Towards a genomics-informed, real-time, global pathogen surveillance system . Nat Rev Genet . 19 ( 1 ): 9 – 20 . Google Scholar Crossref Search ADS PubMed WorldCat Grabowski MK , Herbeck JT, Poon A. 2018 . Genetic cluster analysis for HIV prevention. Curr HIV/AIDS Rep. 15(2):182–189. Hassan AS , Pybus OG, Sanders EJ, Albert J, Esbjörnsson J. 2017 . Defining HIV-1 transmission clusters based on sequence data . AIDS 31 ( 9 ): 1211 – 1222 . Google Scholar Crossref Search ADS PubMed WorldCat Hué S , Clewley JP, Cane PA, Pillay D. 2004 . HIV-1 pol gene variation is sufficient for reconstruction of transmissions in the era of antiretroviral therapy . AIDS 18 ( 5 ): 719 – 728 . Google Scholar Crossref Search ADS PubMed WorldCat Katoh K , Misawa K, Kuma K, Miyata T. 2002 . MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform . Nucleic Acids Res . 30 ( 14 ): 3059 – 3066 . Google Scholar Crossref Search ADS PubMed WorldCat Kroneman A , Vega E, Vennema H, Vinjé J, White PA, Hansman G, Green K, Martella V, Katayama K, Koopmans M. 2013 . Proposal for a unified norovirus nomenclature and genotyping . Arch Virol . 158 ( 10 ): 2059 – 2068 . Google Scholar Crossref Search ADS PubMed WorldCat Kumar S , Filipski AJ, Battistuzzi FU, Kosakovsky Pond SL, Tamura K. 2012 . Statistics and truth in phylogenomics . Mol Biol Evol . 29 ( 2 ): 457 – 472 . Google Scholar Crossref Search ADS PubMed WorldCat Lauber C , Gorbalenya AE. 2012 . Toward genetics-based virus taxonomy: comparative analysis of a genetics-based classification and the taxonomy of picornaviruses . J Virol . 86 ( 7 ): 3905 – 3915 . Google Scholar Crossref Search ADS PubMed WorldCat McCloskey RM , Poon A. 2017 . A model-based clustering method to detect infectious disease transmission outbreaks from sequence variation. Kosakovsky Pond SL, editor . PLoS Comput Biol . 13 ( 11 ): e1005868. Google Scholar Crossref Search ADS PubMed WorldCat McIntyre CL , Knowles NJ, Simmonds P. 2013 . Proposals for the classification of human rhinovirus species A, B and C into genotypically assigned types . J Gen Virol. 94 ( Pt 8 ): 1791 – 1806 . Google Scholar Crossref Search ADS PubMed WorldCat Meilă M. 2007 . Comparing clusterings—an information based distance . J Multivar Anal . 98 ( 5 ): 873 – 895 . Google Scholar Crossref Search ADS WorldCat Ortiz JR , Neuzil KM. 2017 . Influenza immunization of pregnant women in resource-constrained countries: an update for funding and implementation decisions . Curr Opin Infect Dis . 30 ( 5 ): 455 – 462 . Google Scholar Crossref Search ADS PubMed WorldCat Poon A. 2016 . Impacts and shortcomings of genetic clustering methods for infectious disease outbreaks . Virus Evol . 2 ( 2 ): vew031. Google Scholar Crossref Search ADS PubMed WorldCat Poon AFY , Gustafson R, Daly P, Zerr L, Demlow SE, Wong J, Woods CK, Hogg RS, Krajden M, Moore D, et al. . 2016 . Near real-time monitoring of HIV transmission hotspots from routine HIV genotyping: an implementation case study . Lancet HIV 3 ( 5 ): e231 – 238 . Google Scholar Crossref Search ADS PubMed WorldCat Poon AFY , Joy JB, Woods CK, Shurgold S, Colley G, Brumme CJ, Hogg RS, Montaner JSG, Harrigan PR. 2015 . The impact of clinical, demographic and risk factors on rates of HIV transmission: a population-based phylogenetic analysis in British Columbia, Canada . J Infect Dis . 211 ( 6 ): 926 – 935 . Google Scholar Crossref Search ADS PubMed WorldCat Prosperi MCF , Ciccozzi M, Fanti I, Saladini F, Pecorari M, Borghi V, Di Giambenedetto S, Bruzzone B, Capetti A, Vivarelli A, et al. . 2011 . A novel methodology for large-scale phylogeny partition . Nat Commun . 2 : 321. Google Scholar Crossref Search ADS PubMed WorldCat Prosperi MCF , De Luca A, Di Giambenedetto S, Bracciale L, Fabbiani M, Cauda R, Salemi M. 2010 . The threshold bootstrap clustering: a new approach to find families or transmission clusters within molecular quasispecies. Poon AFY, editor . PLoS One 5 ( 10 ): e13619. Google Scholar Crossref Search ADS PubMed WorldCat Pu J , Wang S, Yin Y, Zhang G, Carter RA, Wang J, Xu G, Sun H, Wang M, Wen C, et al. . 2015 . Evolution of the H9N2 influenza genotype that facilitated the genesis of the novel H7N9 virus . Proc Natl Acad Sci U S A . 112 ( 2 ): 548 – 553 . Google Scholar Crossref Search ADS PubMed WorldCat Ragonnet-Cronin M , Hodcroft E, Hué S, Fearnhill E, Delpech V, Brown AJ, Lycett S, Holmes E, Nee S, Rambaut A, et al. . 2013 . Automated analysis of phylogenetic clusters . BMC Bioinform. 14 : 317. Google Scholar Crossref Search ADS WorldCat Rambaut A , Lam TT, Max Carvalho L, Pybus OG. 2016 . Exploring the temporal structure of heterochronous sequences using TempEst (formerly Path-O-Gen) . Virus Evol . 2 ( 1 ): vew007. Google Scholar Crossref Search ADS PubMed WorldCat R Core Team . 2016 . R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. Available from: http://www.R-project.org/. Rose R , Lamers SL, Dollar JJ, Grabowski MK, Hodcroft EB, Ragonnet-Cronin M, Wertheim JO, Redd AD, German D, Laeyendecker O. 2017 . Identifying transmission clusters with cluster picker and HIV-TRACE . AIDS Res Hum Retroviruses 33 ( 3 ): 211 – 218 . Google Scholar Crossref Search ADS PubMed WorldCat Rousseeuw PJ. 1987 . Silhouettes: a graphical aid to the interpretation and validation of cluster analysis . J Comput Appl Math . 20 : 53 – 65 . Google Scholar Crossref Search ADS WorldCat Rousseeuw PJ , Croux C. 1993 . Alternatives to the median absolute deviation . J Am Stat Assoc . 88 ( 424 ): 1273 – 1283 . Google Scholar Crossref Search ADS WorldCat Shepard SS , Davis CT, Bahl J, Rivailler P, York IA, Donis RO. 2014 . LABEL: fast and accurate lineage assignment with assessment of H5N1 and H9N2 influenza a hemagglutinins. Woo PCY, editor . PLoS One 9 ( 1 ): e86921. Google Scholar Crossref Search ADS PubMed WorldCat Simmonds P , McIntyre C, Savolainen-Kopra C, Tapparel C, Mackay IM, Hovi T. 2010 . Proposals for the classification of human rhinovirus species C into genotypically assigned types . J Gen Virol. 91 ( Pt 10 ): 2409 – 2419 . Google Scholar Crossref Search ADS PubMed WorldCat Smith DB , Bukh J, Kuiken C, Muerhoff AS, Rice CM, Stapleton JT, Simmonds P. 2013 . Expanded classification of hepatitis C virus into 7 genotypes and 67 subtypes: updated criteria and assignment web resource . Hepatology 59 : 318 – 327 . Google Scholar Crossref Search ADS WorldCat Smith GJD , Donis RO; World Health Organization/World Organisation for Animal Health/Food and Agriculture Organization (WHO/OIE/FAO) H5 Evolution Working Group WHOO for AH and AO (WHO/OIE/FAO) HEW 2015 . Nomenclature updates resulting from the evolution of avian influenza A(H5) virus clades 2.1.3.2a, 2.2.1, and 2.3.4 during 2013–2014 . Influenza Other Respir Viruses 9 ( 5 ): 271 – 276 . Google Scholar Crossref Search ADS PubMed WorldCat Stamatakis A. 2014 . RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies . Bioinformatics 30 ( 9 ): 1312 – 1313 . Google Scholar Crossref Search ADS PubMed WorldCat Susko E. 2009 . Bootstrap support is not first-order correct . Syst Biol . 58 ( 2 ): 211 – 223 . Google Scholar Crossref Search ADS PubMed WorldCat The Global Consortium for H5N8 and Related Influenza Viruses . 2016 . Role for migratory wild birds in the global spread of avian influenza H5N8 . Science 354 : 213 – 217 . Crossref Search ADS PubMed WorldCat Valastro V , Holmes EC, Britton P, Fusaro A, Jackwood MW, Cattoli G, Monne I. 2016 . S1 gene-based phylogeny of infectious bronchitis virus: an attempt to harmonize virus classification . Infect Genet Evol . 39 : 349 – 364 . Google Scholar Crossref Search ADS PubMed WorldCat Van Doorslaer K , Bernard H-U, Chen Z, de Villiers E-M, zur Hausen H, Burk RD. 2011 . Papillomaviruses: evolution, Linnaean taxonomy and current nomenclature . Trends Microbiol . 19 ( 2 ): 49 – 50 . Google Scholar Crossref Search ADS PubMed WorldCat Volz EM , Koopman JS, Ward MJ, Brown AL, Frost S. 2012 . Simple epidemiological dynamics explain phylogenetic clustering of HIV from patients with recent infection. Fraser C, editor . PLoS Comput Biol . 8 ( 6 ): e1002552. Google Scholar Crossref Search ADS PubMed WorldCat Wang J , Vijaykrishna D, Duan L, Bahl J, Zhang JX, Webster RG, Peiris JSM, Chen H, Smith GJD, Guan Y. 2008 . Identification of the progenitors of Indonesian and Vietnamese avian influenza A (H5N1) viruses from Southern China . J Virol . 82 ( 7 ): 3405 – 3414 . Google Scholar Crossref Search ADS PubMed WorldCat WHO/OIE/FAO H5N1 Evolution Working Group . 2008 . Toward a unified nomenclature system for highly pathogenic avian influenza virus (H5N1 ). Emerg Infect Dis . 14(7):e1. OpenURL Placeholder Text WorldCat WHO/OIE/FAO H5N1 Evolution Working Group WHEW . 2012 . Continued evolution of highly pathogenic avian influenza A (H5N1): updated nomenclature . Influenza Other Respir Viruses 6 :1 – 5 . Crossref Search ADS PubMed WorldCat WHO/OIE/FAO HN Evolution Working Group . 2009 . Continuing progress towards a unified nomenclature for the highly pathogenic H5N1 avian influenza viruses: divergence of clade 2.2 viruses . Influenza Other Respir Viruses 3 : 59 – 62 . Crossref Search ADS PubMed WorldCat World Health Organization/World Organisation for Animal Health/Food and Agriculture Organization (WHO/OIE/FAO) H5N1 Evolution Working Group . 2014 . Revised and updated nomenclature for highly pathogenic avian influenza A (H5N1) viruses . Influenza Other Respir Viruses 8 : 384 – 388 . Crossref Search ADS PubMed WorldCat Yu G , Smith DK, Zhu H, Guan Y, Lam T. 2017 . Ggtree: an R package for visualization and annotation of phylogenetic trees with their covariates and other associated data . Methods Ecol Evol . 8 ( 1 ): 28 – 36 . Google Scholar Crossref Search ADS WorldCat Zharkikh A , Li WH. 1992 . Statistical properties of bootstrap estimation of phylogenetic variability from nucleotide sequences. I. Four taxa with a molecular clock . Mol Biol Evol . 9 ( 6 ): 1119 – 1147 . Google Scholar PubMed OpenURL Placeholder Text WorldCat Author notes These authors contributed equally to this work. © The Author(s) 2019. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. © The Author(s) 2019. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. TI - Phylogenetic Clustering by Linear Integer Programming (PhyCLIP) JF - Molecular Biology and Evolution DO - 10.1093/molbev/msz053 DA - 2019-07-01 UR - https://www.deepdyve.com/lp/oxford-university-press/phylogenetic-clustering-by-linear-integer-programming-phyclip-aDFjgxzAc1 SP - 1580 EP - 1595 VL - 36 IS - 7 DP - DeepDyve ER -