Access the full text.
Sign up today, get DeepDyve free for 14 days.
M. Thompson (2010)
Estimates of deaths associated with seasonal influenza-United States, 1976?2007Morb. Mortal. Wkly Rep, 59
Derek Smith, S. Forrest, S. Forrest, D. Ackley, A. Perelson, A. Perelson (1999)
Variable efficacy of repeated annual influenza vaccination.Proceedings of the National Academy of Sciences of the United States of America, 96 24
C. Russell, T. Jones, I. Barr, N. Cox, R. Garten, V. Gregory, I. Gust, A. Hampson, A. Hay, A. Hurt, J. Jong, A. Kelso, A. Klimov, T. Kageyama, N. Komadina, A. Lapedes, Y. Lin, Ana Mosterín, M. Obuchi, T. Odagiri, A. Osterhaus, G. Rimmelzwaan, M. Shaw, E. Skepner, K. Stohr, M. Tashiro, R. Fouchier, Derek Smith (2008)
The Global Circulation of Seasonal Influenza A (H3N2) VirusesScience, 320
A. Beck, M. Teboulle (2009)
A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse ProblemsSIAM J. Imaging Sci., 2
W. Ampofo, N. Baylor, S. Cobey, N. Cox, Sharon Daves, S. Edwards, N. Ferguson, G. Grohmann, A. Hay, J. Katz, Kornnika Kullabutr, Linda Lambert, R. Levandowski, A. Mishra, A. Monto, M. Siqueira, M. Tashiro, Anthony Waddell, N. Wairagkar, J. Wood, M. Zambon, Wenqing Zhang (2013)
Improving influenza vaccine virus selection: report of a WHO informal consultation held at WHO headquarters, Geneva, Switzerland, 14–16 June 2010, 7
B. Shu, R. Garten, S. Emery, A. Balish, L. Cooper, W. Sessions, V. Deyde, Catherine Smith, L. Berman, A. Klimov, S. Lindstrom, Xiyan Xu (2012)
Genetic analysis and antigenic characterization of swine origin influenza viruses isolated from humans in the United States, 1990-2010.Virology, 422 1
Beck (2009)
183SIAM J. Imaging Sci, 2
M. Biggerstaff, C. Reed, S. Epperson, M. Jhung, M. Gambhir, J. Bresee, D. Jernigan, D. Swerdlow, L. Finelli (2013)
Estimates of the number of human infections with influenza A(H3N2) variant virus, United States, August 2011-April 2012.Clinical infectious diseases : an official publication of the Infectious Diseases Society of America, 57 Suppl 1
Bao (2008)
596J. Virol, 82
Derek Smith, A. Lapedes, J. Jong, T. Bestebroer, G. Rimmelzwaan, A. Osterhaus, R. Fouchier (2004)
Mapping the Antigenic and Genetic Evolution of Influenza VirusScience, 305
R.F. Smith, T.F. Smmith (1992)
Pattern-induced multi-sequence alignment (PUMA) algorithm employing secondary structure-dependent gap penalties for use in comparative protein modellingProtein Eng. Des. Sel, 5
Zhipeng Cai, M. Ducatez, Jialiang Yang, Tong Zhang, Li-Ping Long, A. Boon, R. Webby, X. Wan (2012)
Identifying antigenicity-associated sites in highly pathogenic H5N1 influenza virus hemagglutinin by using sparse learning.Journal of molecular biology, 422 1
R.A. Neher (2016)
Prediction, dynamics, and visualization of antigenic phenotypes of seasonal influenza virusesProc. Natl. Acad. Sci. USA, 113
Zhipeng Cai, Tong Zhang, X. Wan (2010)
A Computational Framework for Influenza Antigenic CartographyPLoS Computational Biology, 6
Randall Smith, Temple Smith (1992)
Pattern-induced multi-sequence alignment (PIMA) algorithm employing secondary structure-dependent gap penalties for use in comparative protein modelling.Protein engineering, 5 1
D.J. Smith (1999)
Variable efficacy of repeated annual influenza vaccinationProc. Natl. Acad. Sci. USA, 96
Feng (2013)
7655J. Virol, 87
K. Mansfield (2007)
Viral tropism and the pathogenesis of influenza in the Mammalian host.The American journal of pathology, 171 4
William Harvey, D. Benton, V. Gregory, James Hall, R. Daniels, T. Bedford, D. Haydon, A. Hay, J. McCauley, R. Reeve (2016)
Identification of Low- and High-Impact Hemagglutinin Amino Acid Substitutions That Drive Antigenic Drift of Influenza A(H1N1) VirusesPLoS Pathogens, 12
X. Chen, Qihang Lin, Seyoung Kim, J. Carbonell, E. Xing (2010)
Smoothing proximal gradient method for general structured sparse regressionThe Annals of Applied Statistics, 6
A. Ng, Michael Jordan, Yair Weiss (2001)
On Spectral Clustering: Analysis and an algorithm
Mark Thompson, David Shay, Hong Zhou, C. Bridges, P. Cheng, Erin Burns, Joseph Bresee, N. Cox (2010)
Estimates of deaths associated with seasonal influenza --- United States, 1976-2007.MMWR. Morbidity and mortality weekly report, 59 33
Yu-Chieh Liao, Min-Shi Lee, Chin-Yu Ko, C. Hsiung (2008)
Bioinformatics models for predicting antigenic variants of influenza A/H3N2 virusBioinformatics, 24 4
Cai (2012)
145J. Mol. Biol, 422
Jhang-Wei Huang, C. King, Jinn-Moon Yang (2009)
Co-evolution positions and rules for antigenic variants of human influenza A/H3N2 virusesBMC Bioinformatics, 10
Yīmíng Bào, P. Bolotov, D. Dernovoy, B. Kiryutin, L. Zaslavsky, T. Tatusova, J. Ostell, D. Lipman (2007)
The Influenza Virus Resource at the National Center for Biotechnology InformationJournal of Virology, 82
Cai (2010)
e1000949.PLoS Comput. Biol, 6
Hailiang Sun, Jialiang Yang, Tong Zhang, Li-Ping Long, K. Jia, Guohua Yang, R. Webby, X. Wan (2013)
Using Sequence Data To Infer the Antigenicity of Influenza VirusmBio, 4
X. Ren, Yuefeng Li, Xiaoning Liu, X. Shen, Wenlong Gao, Juansheng Li (2015)
Computational Identification of Antigenicity-Associated Sites in the Hemagglutinin Protein of A/H1N1 Seasonal Influenza VirusPLoS ONE, 10
W. Ampofo, N. Baylor, S. Cobey, N. Cox, Sharon Daves, S. Edwards, N. Ferguson, G. Grohmann, A. Hay, J. Katz, Kornnika Kullabutr, Linda Lambert, R. Levandowski, A. Mishra, A. Monto, M. Siqueira, M. Tashiro, Anthony Waddell, N. Wairagkar, J. Wood, M. Zambon, Wenqing Zhang (2011)
Improving influenza vaccine virus selectionReport of a WHO informal consultation held at WHO headquarters, Geneva, Switzerland, 14–16 June 2010Influenza and Other Respiratory Viruses, 6
A.C.-C. Shih (2007)
Simultaneous amino acid substitutions at antigenic sites drive influenza A hemagglutinin evolutionProc. Natl. Acad. Sci. USA, 104
Zhixin Feng, Janet Gomez, A. Bowman, Jianqiang Ye, Li-Ping Long, Sarah Nelson, Jialiang Yang, Brigitte Martin, K. Jia, J. Nolting, F. Cunningham, C. Cardona, Jianqiang Zhang, K. Yoon, R. Slemons, X. Wan (2013)
Antigenic Characterization of H3N2 Influenza A Viruses from Ohio Agricultural FairsJournal of Virology, 87
W. Thompson, D. Shay, Eric Weintraub, L. Brammer, C. Bridges, N. Cox, K. Fukuda (2004)
Influenza-associated hospitalizations in the United States.JAMA, 292 11
R. Squires, J. Noronha, Victoria Hunt, A. García-Sastre, C. Macken, N. Baumgarth, D. Suarez, B. Pickett, Yun Zhang, C. Larsen, Alvin Ramsey, Liwei Zhou, S. Zaremba, Sanjeev Kumar, Jon Deitrich, E. Klem, R. Scheuermann (2012)
Influenza Research Database: an integrated bioinformatics resource for influenza research and surveillanceInfluenza and Other Respiratory Viruses, 6
Min-Shi Lee, J. Chen (2004)
Predicting Antigenic Variants of Influenza A/H3N2 VirusesEmerging Infectious Diseases, 10
J.L. Barnett (2012)
Antigenmap 3d: an online antigenic cartography resourceBioinformatics, 28
Biggerstaff (2013)
S12Clin. Infect. Dis, 57
J. Stevens, Li-mei Chen, P. Carney, R. Garten, Angie Foust, Jianhua Le, B. Pokorny, R. Manojkumar, Jeanmarie Silverman, Rene Devis, K. Rhea, Xiyan Xu, D. Bucher, J. Paulson, N. Cox, A. Klimov, R. Donis (2010)
Receptor Specificity of Influenza A H3N2 Viruses Isolated in Mammalian Cells and Embryonated Chicken EggsJournal of Virology, 84
Chen (2012)
719Ann. Appl. Stat, 6
Barnett (2012)
1292Bioinformatics, 28
Y. Lin, V. Gregory, P. Collins, Johannes Kloess, S. Wharton, N. Cattle, A. Lackenby, R. Daniels, A. Hay (2010)
Neuraminidase Receptor Binding Variants of Human Influenza A(H3N2) Viruses Resulting from Substitution of Aspartic Acid 151 in the Catalytic Site: a Role in Virus Attachment?Journal of Virology, 84
Xiao-Tong Yuan, Tong Zhang, X. Wan (2013)
A Joint Matrix Completion and Filtering Model for Influenza Serological Data IntegrationPLoS ONE, 8
A. Shih, Tzu-Chang Hsiao, M. Ho, Wen-Hsiung Li (2007)
Simultaneous amino acid substitutions at antigenic sites drive influenza A hemagglutinin evolutionProceedings of the National Academy of Sciences, 104
J. Barnett, Jialiang Yang, Zhipeng Cai, Tong Zhang, X. Wan
Bioinformatics Applications Note Data and Text Mining Antigenmap 3d: an Online Antigenic Cartography Resource
Y. Shu, J. McCauley (2017)
GISAID: Global initiative on sharing all influenza data – from vision to realityEurosurveillance, 22
Ampofo (2012)
142Influenza Other Respir. Viruses, 6
R. Neher, T. Bedford, R. Daniels, C. Russell, B. Shraiman (2015)
Prediction, dynamics, and visualization of antigenic phenotypes of seasonal influenza virusesProceedings of the National Academy of Sciences, 113
Jialiang Yang, Tong Zhang, X. Wan (2014)
Sequence-Based Antigenic Change Prediction by a Sparse Learning Method Incorporating Co-Evolutionary InformationPLoS ONE, 9
N. Zhou, D. Senne, J. Landgraf, S. Swenson, G. Erickson, K. Rossow, Lin Liu, K. Yoon, S. Krauss, R. Webster (1999)
Genetic Reassortment of Avian, Swine, and Human Influenza A Viruses in American PigsJournal of Virology, 73
Motivation: Influenza virus antigenic variants continue to emerge and cause disease outbreaks. Time-consuming, costly and middle-throughput serologic methods using virus isolates are routine- ly used to identify influenza antigenic variants for vaccine strain selection. However, the resulting data are notoriously noisy and difficult to interpret and integrate because of variations in reagents, supplies and protocol implementation. A novel method without such limitations is needed for anti- genic variant identification. Results: We developed a Graph-Guided Multi-Task Sparse Learning (GG-MTSL) model that uses multi-sourced serologic data to learn antigenicity-associated mutations and infer antigenic var- iants. By applying GG-MTSL to influenza H3N2 hemagglutinin sequences, we showed the method enables rapid characterization of antigenic profiles and identification of antigenic variants in real time and on a large scale. Furthermore, sequences can be generated directly by using clinical sam- ples, thus minimizing biases due to culture-adapted mutation during virus isolation. Availability and implementation: MATLAB source codes developed for GG-MTSL are available through http://sysbio.cvm.msstate.edu/files/GG-MTSL/. Contact: wan@cvm.msstate.edu Supplementary information: Supplementary data are available at Bioinformatics online. 1 Introduction Serologic assays, such as hemagglutination inhibition (HI) and Each year in the United States, influenza causes >200 000 hospital- neutralization inhibition assays, are routinely used during the influ- izations and 23 000 deaths, and many more hospitalizations and enza vaccine strain selection process to identify influenza antigenic deaths occur globally (Thompson et al., 2010, 2004). Vaccination is variants. However, these serologic assays are labor intensive, costly the primary strategy for reducing the impact of influenza outbreaks and middle-throughput, and they require the isolation of virus. (Harper et al., 1984). However, antigenic changes caused by anti- Thus, a genomic sequence-based strategy for antigenic variant iden- genic drift or shift at virus surface glycoproteins, especially hem- tification would be ideal because the genomic sequences can be agglutinin (HA), allow influenza viruses to evade the herd immunity obtained directly from clinical samples, which is efficient and eco- acquired by a population from prior infections or vaccination. The nomic. Such a method must be able to quantify antigenicity directly key to a successful influenza vaccination program is to select a using genomic sequences. That is, a quantitative function between vaccine candidate that antigenically matches the viruses that will be the virus genetic information (i.e. mutations in protein sequences) to circulating during the coming influenza season. the virus antigenic properties (i.e. changes in antigenicity) should be V The Author(s) 2018. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com 77 78 L.Han et al. AB Fig. 1. Graph structure of the multi-task sparse learning model. (A) The tasks are first divided into three groups according to different data sources (i.e. HI datasets generated using turkey erythrocytes without neuraminidase inhibitor, guinea pig erythrocytes without neuraminidase inhibitor or guinea pig erythrocytes with neuraminidase inhibitor). Then, in each group, the tasks are formulated by sliding windows, denoted by circles. The edges indicate the information sharing among tasks with task similarity weight x .(B) A general graph structure of the multi-task learning concept i;j developed. In the past decade, there have been a few attempts on before HI assays to minimize the effects of neuraminidase-mediated these efforts. For instance, a simple statistical analysis of the correl- hemagglutination (Lin et al., 2010). Thus, integrating such data for ation between the HI titer and the count of mutations was used by machine learning is not a trivial task (Yuan et al., 2013). To the best (Lee and Chen, 2004); Regression and Bayesian models were intro- of our knowledge, none of the existing methods can perform duced by treating the mutations as features and the HI values or sequence-based antigenicity inference effectively with direct usage of antigenic similarities between sequences as responses (Harvey et al., these large amount of diverse data. In this study, we developed a 2016; Liao et al., 2008; Mansfield, 2007; Ren et al., 2015); more re- novel a Graph-Guided Multi-Task Sparse Learning (GG-MTSL) cently, sparse learning techniques (Cai et al., 2012; Neher et al., model to learn antigenicity-associated residues from multi-sourced 2016; Sun et al., 2013; Yang et al., 2014) have been proposed to re- serologic data. A quantitative model was developed to determine duce the affine relationship to sparse structure with concentration antigenic distances between any two viruses given their HA protein on a few key mutated residues. Nevertheless, these prior studies sequences, and this model was further applied to illustrate the anti- have demonstrated that only a small number of residues on influ- genic drift patterns of these human A(H3N2) influenza viruses. enza surface glycoproteins, especially antibody binding sites at HA, are associated with antigenic drift of influenza viruses (Smith et al., 2004; Sun et al., 2013), providing rationales for applying sparse 2 Materials and methods learning based methods in developing an effective sequence-based antigenic variant predictor using serologic data. 2.1 GG-MTSL model One necessary condition for developing a robust sequence-based 2.1.1 Problem formulation antigenicity inference methods is the availability of large scale of The overall goal of this study was to develop a genomic sequence- both genomic and serologic data, which must be derived from influ- based antigenicity inference method. Our serologic datasets were enza viruses with much diversity on both genetic mutations and anti- composed of data generated by using three different protocols: tur- genic characteristics. Fortunately, a large set of serologic data for key erythrocytes with untreated viruses, guinea pig erythrocytes H3N2 influenza viruses have been generated during past decades, with untreated viruses, and guinea pig erythrocytes with neuramin- providing an opportunity to develop, validate and apply machine idase inhibitor-pretreated viruses. Viruses involved in these datasets learning in antigenic analyses. However, such serologic data are span a long time period and may not react with each other; the data- notoriously noisy and difficult to interpret because of inherent varia- sets included a large quantity of low reactors, had missing HI titers, tions in reagents, supplies and protocol implementation by labora- and had a unique distribution of data (Cai et al., 2010), presenting a tory personnel (Yuan et al., 2013) and because of human error challenge in matrix completion and determination of accurate dis- (Ampofo et al., 2012). In addition, over time, the protocols have tances for the low reactors. We formulated the problem of dealing been updated to minimize the effects of changing biologic attributes with different protocols and with viruses spanning a long time during virus evolution. For example, erythrocytes from various hosts period into a multi-task problem by separating the data into mul- were used to improve hemagglutination (Ampofo et al., 2012), and tiple temporal tasks (Fig. 1); within each task, the HI data was gen- neuraminidase inhibitor was used to pretreat influenza viruses erated by using the same protocol and with a minimal number of …… …… …… Graph-guided multi-task sparse learning model 79 Multi -Source Serological HI matrices HA1 Sequences >BI/15793/1968 QDLPGNDNSTATLCLGHHAVPNGTLVKTITNDQIE >BI/16190/1968 QDLPGNDNSTATLCLGHHAVPNGTLVKTITDDQIE >BI/16398/1968 QDLPGNDNSTATLCLGHHAVPNGTLVKTITDDQIE >HK/1/1968 QDLPGNDNSTATLCLGHHAVPNGTLVKTITVDQIE >BI/808/1969 QDLPGNDNSTATLCLGHHAVPNGTLVKTITDDQIE A >BI/908/1969 QDLPGNDNSTATLCLGHHAVPNGTLVKTITNDQIE >BI/17938/1969 QDLPGNDNSTATLCLGHHAVPNGTLVKTITNDQIE >BI/93/1970 QDLPGNDNSTATLCLGHHAVPNGTLVKTITNDQIE >BI/2668/1970 QDLPGNDNSTATLCLGHHAVPNGTLVKTITNDQIE >BI/6449/1971 QDLPGNENSTATLCLGHHAVPNGTLVKTITNDQIE >BI/21438/1971 QDLPGNDKSTATLCLGHHAVPNGTLVKTITNDQIE Turkey Guinea Pig Inhibitor Task 1 Target Function The least square loss The graph-guided regularizer Task N Task N +1 1 1 Task Task N +N N +N +1 1 2 1 2 The sparse regularizer Task N +N +N 1 2 3 A/Hong Kong/1/1968 HK68 A/England/42/1972 EN72 A/Victoria/3/1975 EN72 HK68 VI75 A/Sichuan/2/1987 SI87 A/Beijing/352/1989 BE89 Linear prediction function A/Beijing/32/1992 VI75 Importance Score A/Wuhan/359/1995 BE92 SI87 A/Sydney/5/1997 WU95 A/Fujian/411/2002 SY97 Residue A/California/07/2004 FU02 BE92 A/Brisbane/10/2007 Task 1 Task 2 Task N CA04 BE89 A/Perth/16/2009 BR07 A/Texas/50/2012 A/Switzerland/9715293/2013 PE09 A/Hong Kong/4801/2014 TX12 WU95 SWZ13 145 0.1092 0.1075 HK14 SY97 0.0709 0.0867 FU02 40,000 Testing Sequences BR07 CA04 159 0.3456 HK14 TX12 PE09 SWZ13 CD Fig. 2. Workflow of the multi-task learning system. (A) The data processing integrates two types of data: sequence data (e.g. HA1 sequences shown in the left panel) and serologic data (e.g. HI data in the right panel). (B) Multiple tasks are formulated and integrated via a graph (Fig. 1). Specifically, in this study, the sero- logic data from multiple sources (e.g. data generated in a different time or using different protocols [i.e. HI datasets generated using turkey erythrocytes without neuraminidase inhibitor, guinea pig erythrocytes without neuraminidase inhibitor or guinea pig erythrocytes with neuraminidase inhibitor]) were separated into >50 individual tasks and processed by the multi-task matrix completion model. (C) Graph-based multi-task feature learning is conducted to identify and integrate influenza virus antigenicity-associated sites and their weights for each individual task. The finalized residues and associated weights are used to develop an en- semble prediction model to quantify antigenic distances given protein sequences. (D) Large-scale, sequence-based antigenic maps are constructed, and antigenic evolution of influenza viruses is studied by using data mining and machine learning (e.g. spectral clustering to antigenic drift events and Bayesian modeling to identify temporal and spatial origins for influenza antigenic variants) low reactors (Supplementary Fig. S1). Of note, the problem formula- and then, after evaluating the results of temporal task generation tion can be conveniently generalized to any multi-sourced dataset by obtained by using window sizes of 4, 6, 8, 12, 14 and 16 years, we splitting tasks according to protocols or specific settings, and then chose 12 years as the window size to generate temporal tasks. This within each source the tasks can be further decentralized along the window size is the same as that used in other studies, suggesting that temporal (Fig. 1A) or other dimensions (Fig. 1B). The key is that as a window size of 12 years achieved the best performance in minimiz- long as the connections inter- or intra-sources can be clearly repre- ing the effects of low-reactor viruses (Cai et al.,2010; Sun et al., sented as a general graph as the one shown in Figure 1, and then our 2013). A GG-MTSL method was then developed to identify key fea- learning framework can adopt any general graph structure to learn tures associated with viral antigenicity, and a quantitative function multiple tasks simultaneously. Following the protocol in the literature was developed to measure antigenic distances between influenza A (Cai et al.,2010), we sorted the viruses and serum samples by time viruses on the basis of their HA protein sequences. As shown in Prediction Training Multi-Task Formulation Data Processing 80 L.Han et al. Figure 2, multi-task learning consisted of three integrated steps: multi- the synergetic effects among multiple residues. The input data ma- N d task matrix completion; dynamic multi-source multi-task feature trix X 2 R for the i-th task contains the pairwise genetic distan- learning; and proposal of an ensemble antigenicity prediction model. ces for all the viruses in the corresponding window, where N is the N 1 number of pairs. The response y 2 R indicates the pairwise antigenic distance calculated from the HI matrix. Let G¼ðÞ V; E de- 2.1.2 Multi-task matrix completion note the graph, where V is the set of nodes and E indicates the edge Serologic data can typically be classified into three types of informa- set. If we encode each node as a task, then an edge in the graph tion: high-reactor data, low-reactor data or data with missing val- implies that the connected tasks are close to each other, and the ues. The assessment of these three types of data can be naturally weight on the edge indicates the strength of their similarity. Now, formulated as a low-rank matrix completion problem (Cai et al., the graph-based multi-task model aims to make the tasks connected 2010). In this study, we proposed a multi-task matrix completion by edges share similar parameters and it can be formulated as an op- method by separating matrix completion into multiple tasks. To timization problem: minimize the effects of the protocols on HI data for data integration, we ensured that the HI data in each individual task were generated by the same protocol. One major challenge in the multi-task matrix min jjy X w jj þ i i i dT TN W2R i¼1 completion method is that the optimal rank for each individual task 2 3 (2) may not be the same; it is not practical to optimize a universal rank 4 5 /a x jjw w jj þ ð1 aÞjjWjj ; for all tasks. To overcome this challenge, we adopted the nuclear i;j i j 2 1 ðÞ i;j 2E norm-based regularization technique to optimize ranks for each in- dividual tasks by penalizing the small eigenvalues in the matrix to be where (i, j) denotes an edge between the i-th task and j-th task; zeros (Han and Zhang, 2016; Jaggi et al., 2010). By solving the nu- d1 w 2 R is the model parameter of the i-th task; jj jj and jj jj 2 1 clear norm regularized problem, the optimal rank for each individ- indicate the ‘ and ‘ norms of vector and matrix, respectively; / 2 1 ual matrix completion task can be automatically identified. and a (0 a 1) are regularization parameters that control the Formally, given a m n sub-matrix A and the set of regular overall sparseness and the trade-off between the task similarity and entries and low reactors in A (denoted as a set X), the considered sparsity, respectively. In problem (2), the first term is the averaged matrix completion problem is to infer the missing values condition- square loss defined on the linear function mapping the genetic varia- ing on the regular entries and low reactors while completing these tions to the antigenic variations; the second term penalizes the ‘ low reactors with a more confident value, by solving the optimiza- norm of the difference between the parameters of any pair of con- tion problem: nected tasks, and the effect of this ‘ norm is to make the parameters m n XX from two tasks to be similar and hence share common patterns in X X 2 X min ðH A Þ IðH hÞþ kjjHjj ; (1) i;j i;j i;j the affine relationship; the ‘ term is employed to make the solution H 2 i¼1 j¼1 sparse and force the solution to select the important residues for each task, and this term is regardless of the graph structure. where the matrix H is the estimated completed matrix of A; H denotes i;j Solving problem (2) is not trivial because the second term in (2) the (i, j)-th element of H; H denotes the projection of H on the set X i;j i;j is non-smooth and general sub-gradient based optimization (i.e.ðÞ i; j 2 X); I is the indicator function; h is a predefined threshold for algorithms are inefficient. We proposed to employ the smoothing identifying the low-reactor value, where we set h ¼ log 20 and 20 is minðÞ m;n proximal gradient (SPG) method (Chen et al., 2012) to solve it. The the signal of the low-reactor value in the HI titers; jjHjj ¼ r i¼1 problem considered by the SPG method takes the form is the nuclear norm, which is the sum of all the singular values r sof H; i’ min fWðÞþ rZðÞ, where fðÞ is convex and Lipschitz continuous and k is a regularization parameter to trade off between data fitting and and rðÞ is convex but non-smooth. In order to employ the SPG the regularization of the matrix rank. In formulation (1), the first term method, we used fðÞ to represent the first term and rðÞ to represent is the least square loss defined on the regular values that are larger than the second term in (2). We could then rewrite rðÞ as h (by noting the indicator function), because these entries have true val- ues in A; the second term is the nuclear norm of H that forcing the small eigenvalues of the estimated matrix H to be zeros and thus H will have Algorithm 1 SPG algorithm for solving the GG-MTSL models. low rank with the rank detected automatically via this penalization. The algorithm for solving the nuclear norm-regularized matrix comple- ð0Þ Require: X, Y, l, x, k and W . tion problem in (1) is straightforward by following the existing Ensure: W. approaches (Jaggi et al., 2010; Yuan et al.,2013). The final HI matrix Initialize t ¼ 0 and s ¼ 1; is calculated by averaging the overlapped entries from multiple sub- repeat matrices from each learning task. ðtÞ ~ c Compute r f ðW Þ as in (6); Solve the proximal step: 2.1.3 GG-MTSL ðtÞ ðtÞ ðtÞ In this study, we assume the variations in serologic data with similar ðtþ1Þ ~ ~ c c c W ¼ arg min f ðW ÞþhW W ; r f ðW Þi temporal information would be determined with similar variations ðtÞ (i.e. genetic features, such as residues) in genetic data, regardless of c þ jjW W jj þ kjjWjj : (3) F 1 the sources of the serologic data. Thus, we can logically represent the relationships among individual tasks by using graphs based on s ¼ ; tþ1 tþ3 temporal orders (Fig. 1). Next, we explain how to apply the graph ðtþ1Þ ðtÞ ðtþ1Þ 1s ðtþ1Þ c t c W ¼ W þ s ðW W Þ; tþ1 structure to establish the multi-task learning framework. t ¼ t þ 1; Formally, let T be the number of tasks and d be the number of residues (the feature dimensionality). d can either be the sum of the until some convergence criterion is satisfied. number of residues and the number of co-mutations if we consider Graph-guided multi-task sparse learning model 81 ðÞ t rWðÞ¼jjCW jj þ kjjWjj ; ðÞ tþ1 ðÞ t 1;2 1 ~ c W ¼ H W r f W ; k W Em where k ¼ /ð1–aÞ and C 2 R (E ¼jEj is the number of edges) is where H ðÞ x ¼ signðÞ x maxðÞ x k; 0 is the soft-thresholding oper- a sparse matrix with each row containing only two non-zero entries, ator used in solving the Lasso problem (Beck and Teboulle, 2009). 1 and –1, in two corresponding positions, denoting an edge in the graph G. For example, when the graph is a chain, the matrix C is 2 3 2.1.4 Ensemble prediction model proposal x x 0 i;j i;j 6 7 After solving (2), we can obtain the coefficient vector w for each 6 7 0 x x i;j i;j 6 7 task i, indicating the importance of each residue in task i. Now, C ¼ /a : 6 7 6 7 0 0 given the sequences of a pair of viruses, i and j, we need a scoring 4 5 function to predict the antigenic distance between them. Suppose 0 x x i;j i;j virus i is from year a , then, we define our prediction model as Based on the definition of the dual norm, r(W) can be 1 l global local local reformulated as y ¼ x lw þ ðw þ w Þ (7) i j rWðÞ¼ maxhCW ; Aiþ kjjWjj ; (4) A2Q where x is the genetic distance vector based on the sequences; yb is the predicted antigenic distance between the two viruses; where a is a vector of auxiliary variables corresponding to the i-th i P w t > t t2A a > ðÞ global t¼1 local i row of CW ; A ¼ðÞ a ; .. . ; a is the auxiliary matrix variable, and 1 E w ¼ and w ¼ ; AðÞ a denotes the set of T i jAðÞ a j Q ¼fAjjja jj 1; 8ig is the domain of A. Then the smooth ap- tasks that the year a is covered by any task in AðÞ a ; jAðÞ a j denotes i i i proximation of the first term in (4) is given by the cardinality of AðÞ a ; and l is a parameter trading off between > the global coefficient and the local coefficient. Note that the pro- g ðÞ W ¼ maxhCW ; Ai ldAðÞ; (5) A2Q posed prediction model in (7) is an ensemble of two parts, the global coefficient and the local coefficient, under a trade-off parameter where hCW ; Ai is the inner product of the two matrices, 0 l 1. Because our GG-MTSL model in (7) captures the dis- dAðÞ¼ jjAjj , and jj jj is the matrix Frobenius norm. (5) is F F tinct antigenically associated residues with respect to each task, the convex and smooth with gradient rg ðÞ W ¼ A C, where A is local local coefficient, w , reveals the important residues in a certain the optimal solution to (5) (Han and Zhang, 2015). The computa- global local time period, while the global coefficient, w , captures the tion of A is depicted as follows (Chen et al., 2012; Han and Zhang, information of the important residues in the entire H3N2 influenza 2015): virus history. Proposition 1. By denoting by A ¼ a ; .. . ; a the optimal solution to (5), for any i, we have 2.1.5 Parameter tuning and performance evaluation ½ CW In the problem of (1), a regularization parameter k needs to be tuned a ¼ S ; l to obtain the best performance of the matrix completion. We chose k from a candidate set½ 0:1; 0:2; .. . ; 1:0 , which was found to be a reasonable range to effectively achieve low rank estimations. The > > where½ CW denotes the i-th row of the matrix CW , and S(x) is performance of different choices of k was evaluated under 10-fold the projection operator to project vector x on the ‘ ball as cross-validation, in which, during each fold, we randomly chose 90% of the known values (i.e. the high reactors) for training and use < ; jjxjj > 1; jjxjj S ¼ the remaining 10% of values as the testing set. We used the relative x; jjxjj 1: mean square error (ReMSE) for performance assessment Then, instead of directly solving (2), we solve its approximation H A i;j i;j ðÞ i;j 2S ReMSE ¼ P ; as ðÞ i;j 2S i;j minfWðÞþ kjjWjj ¼ fWðÞþ g ðÞ W þ kjjWjj : 1 1 W where S denotes the testing set of elements. In the problem of (2), for simplicity, we set x ¼ 1 if task i and i;j The gradient of fWðÞ with respect to W can be computed as task j are from the neighbored windows; otherwise x ¼ 0. Since i;j ~ different HI sources have overlapped time windows, the tasks from r fWðÞ ¼r fWðÞ þ A C: (6) W W different data sources can also be connected in the graph. In addition ðÞ By using the square loss, the i-th column of r fW can be easily W to x , there are two regularization parameters, / and a, that need i;j obtained as to be tuned to obtain the best performance of the multi-task feature learning. / controls the overall sparseness of the solution, and a XðÞ X w y : i i i i trades off between the task similarity and the element-wise sparsity. TN 4 3 2 We proposed to choose / from a candidate set 10 ; 10 ; .. . ; 10 Moreover, it is easy to prove that fWðÞ is L-Lipschitz continuous which are common practice used by sparse learning methods (Han where L can be determined by numerical approaches (Chen et al., and Zhang, 2015, 2016; Han et al., 2016), and a from 2012). The SPG algorithm is depicted in Algorithm 1, where (3) has ½ 0:1; 0:2; .. . ; 0:9 (because a is a value in the interval½ 0; 1 ) via 10- a closed-form solution as fold cross-validation. A larger / will induce a sparser solution, and a 82 L.Han et al. larger a will lead to more similar tasks in the solution. We used the 2.5 Reverse genetics and serologic assays average rooted mean square error (RMSE) for performance assess- The full-length cDNA for HA and neuraminidase genes of influenza ment, which is defined as: A/Texas/50/2012(H3N2) virus were amplified by using SuperScript sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi One-Step RT-PCR (Invitrogen, Grand Island, NY), and the 6: 2 re- combinant viruses with six internal genes of influenza A/PR/8/ RMSE ¼ ðÞ yb y : 1934(H1N1) virus were generated by using reverse genetics. Site i¼1 mutagenesis was performed using a QuikChange II Site-Directed Mutagenesis Kit (Stratagene, La Jolla, CA). Serologic assays were 2.2 Antigenic cartography and identification of performed using 0.5% turkey erythrocytes. antigenic clusters The antigenic maps were constructed by using AntigenMap (Barnett et al., 2012; Cai et al., 2010), which is based on the antigenic dis- 3 Results tance matrix derived from serologic data or the GG-MTSL model 3.1 Graph-guided multi-task sparse learning model can described above. To identify the antigenic clusters in antigenic car- predict the antigenic variant using sequence data tography, we used a spectral cluster method (Ng et al., 2002). Our The aim of this study was to develop a genomic sequence-based anti- spectral clustering method does not require prior knowledge of the genicity inference method and then to understand antigenic evolu- number of clusters because the number can be determined through a tion of subtype H3N2 influenza A viruses by using large-scale nuclear norm regularization algorithm. antigenic profiles derived from this method. To achieve these goals, we proposed a novel GG-MTSL method and multi-source serologic 2.3 Genetic and antigenic distance generation data to identify key features associated with viral antigenicity. A The sequence-based antigenic inference in our machine learning sys- quantitative function was then developed to measure antigenic dis- tem explores genetic features that determine the connections be- tances between influenza A viruses on the basis of their HA protein tween genetic and antigenic variations. The pairwise genetic sequences. We formulated the problem (i.e. dealing with different distances were measured by using a binary coding function or a protocols and viruses spanning a long time period) into a multi-task pattern-induced multi-sequence alignment (PIMA) scoring function problem by separating the data into multiple temporal tasks (Fig. 1). (Smith and Smmith, 1992). When considering the synergetic effects As shown in Figure 2, multi-task learning consisted of three inte- among multiple residues, the co-mutation features were represented grated steps: multi-task matrix completion; dynamic multi-source, by the product among the according single genetic features. After we multi-task feature learning; and proposal of an ensemble antigenic- thoroughly compared the learning performances derived from these ity prediction model. two scoring systems, we adopted the PIMA scoring system for con- During multi-task matrix completion, we optimized k ¼ 0:3by ducting all analyses in this study. selecting the best average performance for each individual task The pairwise antigenic distances between viruses and antigens (Supplementary Fig. S2A). To evaluate the overall performance of were derived from antigenic cartography (Cai et al., 2010). Each multi-task matrix completion, we used cross validation by randomly unit in the antigenic map corresponds to a 2 log ðÞ HI ; in antigenic blinding 10% of the high reactors in the matrices for testing because maps, 2 units of antigenic distance represent a 4-fold change in HI there is no ground truth for the missing values and low reactors. titers, which, as described elsewhere (Smith et al., 1999), defines Results showed that multi-task matrix achieved the best ReMSE whether one virus is an antigenic variant of the other. (0.0413), which indicates only a 4.14% error rate of the true values in the original HI matrices. The training time of completing the 2.4 Sequence and serologic data matrices associated with all tasks was 1.2 min. The HI data used in antigenic cartography and machine learning During multi-task feature learning, we optimized various param- were collected from the literature (Smith et al., 2004; Sun et al., eters in the model by selecting the average performance for each in- 2013) and from the annual reports for vaccine strain selection by the dividual task (Supplementary Fig. S2B–E). Results showed that World Health Organization influenza collaborative centers, includ- / ¼ 10 serves as a cutoff for the best trading-off between the num- ing data for 1528 viruses and 303 serum samples. HI data were ber of selected residues and model performance. For a, we observed obtained by using assays based on turkey erythrocytes (samples from that larger a always achieved lower average RMSE, implying that 1968 to 2009), guinea pig erythrocytes (samples from 2006 to 2013) the multi-task sharing is important for boosting model performance. or guinea pig erythrocytes with neuraminidase inhibitor-pretreated Hence, we set a ¼ 0:9. With these parameter settings, we finally viruses (samples from 2012 to 2016) (Supplementary Fig. S1). achieved an average training RMSE of 0.78 (units) with around 24 Because of the multiple variations in the way these multi-sourced sero- selected residues per task. To evaluate the overall prediction per- logic data were collected, they presented a challenge to data integra- formance of the GG-MTSL system procedure and benchmark tion (Yuan et al.,2013) and, thus, provided a rationale for applying single-task learning method (Lasso model), we numerically validated multi-task learning methods. our method by using historical HI data for training to predict future The full length of the HA protein sequences for 39 370 human virus antigenicity given their sequences. Following a published influenza A(H3N2) viruses collected during 1968–2016 were protocol (Sun et al., 2013), we used the HI data from½ 1996; k for obtained from public databases (Supplementary Fig. S1). The se- training and predicted the antigenic distance between any pair of quence data are downloaded from public databases, including viruses in the consequent years½ k; k þ 1 , where k 2½ 2009; 2016 . Influenza Virus Resource (Bao et al., 2008), Influenza Research We reported the performance in terms of antigenic distance predic- Database (Squires et al., 2012) and GISAID (a global initiative on tion errors on the antigenic drift identification accuracy. Here anti- sharing all influenza data) (Shu and McCauley, 2017). Sequence and genic distances >4-fold (2 units of antigenic distance) were treated serologic data are available for download through http://sysbio.cvm. as antigenic drift, and we used this value as the threshold to msstate.edu/files/GG-MTSL/. partition each pair of antigens into either non-variant or variant. Graph-guided multi-task sparse learning model 83 Table 1. Residue sites identified to be associated with influenza A(H3N2) virus antigenicity Site ABS w Site ABS w Site ABS w Site ABS w Co-mutation ABS w 25 – 0.0476 131 A 0.0126 174 D 0.0104 219 D 0.0075 h193; 196ihB, Bi 0.0042 31 – 0.0366 133 A 0.0206 186 B 0.0325 223 — 0.0494 h142; 196ihA, Bi 0.0025 45 C 0.0369 135 A 0.0209 188 B 0.023 225 — 0.0151 h50; 196ihC, Bi 0.0025 50 C 0.0278 137 A 0.0163 189 B 0.0511 226 D 0.0468 h196; 225ihB,-i 0.0020 53 C 0.0878 138 A 0.0791 190 B 0.0097 230 D 0.0176 h193; 225ihB,–i 0.0018 57 E 0.0529 140 A 0.0457 192 B 0.0262 242 D 0.0329 h157; 189ih–, Bi 0.0017 62 E 0.0431 142 A 0.0306 193 B 0.073 260 E 0.0332 h140; 196ihA, Bi 0.0016 67 E 0.0354 144 A 0.071 196 B 0.1357 262 E 0.0214 h145; 173ihA, Di 0.0015 75 E 0.0057 145 A 0.0625 198 B 0.0294 275 C 0.0083 h188; 196ihB, Bi 0.0015 78 E 0.0123 155 B 0.0092 199 — 0.0576 276 C 0.0252 h189; 196ihB, Bi 0.0015 82 E 0.0159 156 B 0.0741 201 D 0.006 278 C 0.0635 h158; 189ihB, Bi 0.0009 83 E 0.0347 158 B 0.1373 202 — 0.026 280 C 0.0152 h145; 225ihA,–i 0.0007 88 E 0.053 159 B 0.0335 207 D 0.0193 299 C 0.0098 h144; 225ihA,–i 0.0006 112 — 0.0169 160 B 0.0211 212 D 0.0063 311 C 0.0085 h145; 159ihA, Bi 0.0006 121 D 0.018 163 B 0.0301 213 D 0.0148 312 C 0.0122 — — — 122 A 0.0275 172 D 0.0045 214 D 0.0382 — — — — — — 124 A 0.032 173 D 0.102 217 D 0.0105 — — — — — — † global Note: ABS, antibody binding site; ‘’, residue that is not in ABS; ‡66 selected residues with global coefficient, w , are included in the table, and the local local coefficients, w , for different tasks are provided in Supplementary Table S2. Then, we could define classification tasks to measure the prediction 3.2 Mutations on HA protein drive antigenic drifts for accuracy. H3N2 viruses in human We compared the GG-MTSL method with two other multi-task When we used optimized settings, each of 50 learning tasks in learning methods, the ‘ norm regularized MTL and ‘ norm 1;2 1;1 the model selected an average of 24 residues to be associated with regularized MTL (Liu et al., 2009), and two single task learning antigenicity of H3N2 viruses, and a total of 66 unique residues methods, the Lasso and Ridge regressions. Results showed that the were obtained from all 50 learning tasks (Table 1). These residues average RMSE of the GG-MTSL system was 0.9154 (units) and its were mapped onto a 3D structure of the HA protein (Supplementary average accuracy for identifying antigenic variants was 85.55% for Fig. S3). Among these 66 residues, 59 were located in reported anti- k 2½ 2009; 2016 . Such results outperformed all the other four meth- body binding sites A-E. To test the synergetic effects of multiple resi- ods which we compared with (Supplementary Table S1). Moreover, dues in the learning, by following (Yang et al., 2014), we also the training time of the GG-MTSL on the entire feature learning incorporated all the pairwise co-mutations among the resides locat- task was 2.5 min, which were much faster compared with the other ing on the surface of the protein structure into the GG-MTSL multi-task learning methods and slightly slower compared with the procedure. After training, the GG-MTSL system identified 186 co- single-task methods, indicating that our optimization algorithm for mutation pairs and the top-10 pairwise co-mutations were h193; 196i; solving the GG-MTSL model is efficient (Supplementary Table S1). h142; 196i; h50; 196i; h196; 225i; h193; 225i; h157; 189i; h140; 196i; These results demonstrated the effectiveness of the GG-MTSL model h145; 173i; h188; 196i and h189; 196i. Interestingly, four co-mutation for inferring antigenicity. pairs with non-zero weights: h158; 189i; h145; 225i; h144; 225i and For the ensemble prediction model, we need to optimize l, which h145; 159i were shown to have caused antigenic drifts of H3N2 leverages the ratio of the component of local coefficients viruses (Tables 1 and 2). Overall, the multi-task learning identified (Supplementary Table S2) and global coefficients (Table 1). that mutations N145S-N225D-A138S-F159S and N145S-N225D- However, optimizing l over the available HI data will lead to l ! 0 N144S-F159Y-Q311H drove the emergence of influenza viruses because we barely have the ground truth value of the antigenic A/Switzerland/9715293/2013 (SWZ13) and A/Hong Kong/5738/2014 distance between two viruses lying outside a window (due to the (HK14) from influenza virus A/Texas/50/2012 (TX12) (Table 2). band-matrix shaped structure that off-diagonal values are generally These mutations are located in antibody binding sites A or B. It missing or due to low reactors). That is, the available testing data was probable that viruses with mutation N145S-N225D served as generally lie within the same task and will tend to emphasize the intermediate precursor viruses for SWZ13 and HK14, a suggestion local weights, making them dominant. However, by evaluating l supported by phylogenic analyses and antigenic cartography (Fig. 3A over a candidate set½ 0:1; 0:2; .. . ; 0:9 , we clearly identified the one and B). re-emerged event (i.e. some H3N2 variant (H3N2v)-like viruses To confirm this hypothesis, we used the HA and neuraminidase were predicted to be antigenically similar to the A/Beijing/32/ genes of TX12 as template to generate six mutant viruses (bold 92(H3N2) (BE92) cluster, when l ¼ 0:2). Hence, we set l ¼ 0:2in letters indicate mutations against TX12-like viruses): 145N-225N- our study. 138A-144N-159F-311Q (TX12-like), 145S-225D-138A-144N- In summary, study results suggested that GG-MTSL could not 159F-311Q (intermediate-like), 145S-225D-138S-144N-159S-311Q only predict the antigenic variants, but could also achieve much bet- (SWZ13-like), 145S-225D-138A-144S-159Y-311H (HK14-like), ter prediction performance than a single-task learning system by 145N-225N-138S-144N-159S-311Q (TX12-like) and 145N-225N- overcoming difficulties associated with integrating serologic data 138A-144N-159Y-311H (TX12-like). Serologic testing showed that derived by using different protocols and obtained from multiple these reassortants antigenically matched our predicted results time periods and sources. (Table 3 and Fig. 3D). Specifically, results showed that mutant 84 L.Han et al. 145S-225D-138A-144N-159F-311Q is located among the center In addition, we used our scoring function from the GG-MTSL position of TX12, SWZ13 and HK14 viruses; that 145S-225D- ensemble model to predict the pairwise antigenic distance of the mu- 138S-144N-159S-311Q is close to SWZ13 virus; and that 145S- tant viruses. The predicted antigenic distances had a correlation co- 225D-138A-144S-159Y-311H is close to HK14-like virus. Such efficient of 0.75 compared with the HI assay-based antigenic results indicated that mutations N145S-N225D-A138S-F159S distances, which suggested a high correspondence between real anti- caused antigenic drift from TX12 to SWZ13 and that mutations genic distances (HI-based) and predicted antigenic distances (se- N145S-N225D-N144S-F159Y-Q311H caused antigenic drift from quence-based). Machine learning results suggested that 1–5 TX12 to HK14 (Table 2). Furthermore, the antigenic variants of mutations led to the antigenic changes in the four antigenic drift SWZ13 and HK14 were derived from the same intermediate variant events since 2007. bearing residues 145S-225D-138A-144N-159F-311Q in their HA protein sequences. 3.3 Large-scale sequence-based prediction infers anti- genic profile of H3N2 seasonal influenza viruses The quantitative function using these features described in Table 1 Table 2. Antigenic drift events for seasonal influenza A(H3N2) was developed and then applied to quantify antigenic distances viruses, in order of occurrence (2007–2016), and the residues deter- among 39 370 H3N2 viruses recovered from influenza virus- mining the drift events infected humans during 1968–2016 (Fig. 4). Antigenic cartography Antigenic drift event Predominant mutations was constructed, and by using a spectral clustering algorithm (which does not require a predetermined cluster number) 16 antigenic clus- BR07 ! PE09 K158N-N189K ters (HK68, EN72, VI75, TX77, SI87, BE89, BE92, WU95, SY97, PE09 ! TX12 N278K-S45N FU02, CA04, BR07, PE09, TX12, SWZ13 and HK14) were identi- TX12 ! SWZ13 N145S-N225D-A138S-F159S fied with an average Silhouette index of 0.7486; the Silhouette index TX12 ! HK14 N145S-N225D-N144S-F159Y-Q311H is a value ranging from 1 to 1, with higher values indicating better Note: BR07, A/Brisbane/59/2007; HK14, A/Hong Kong/4801/2014; clustering performance. A total of 15 antigenic drift events were PE09, A/Perth/16/2009; SWZ13, A/Switzerland/9715293/2013; TX12, identified as leading to 16 antigenic variants; the most recent A/Texas/50/2012. drift from TX12 during the 2013–2014 influenza season led to two Fig. 3. Co-circulation of two influenza A H3N2 virus antigenic variants, SWZ13-like and HK14-like viruses. (A) Phylogenic analyses demonstrating genetic diversity of H3N2 viruses during the 2015–2016 influenza season. Shaded samples represent viruses that emerged in 2015–2016, implying that the two clades were still co- circulating as of 2016. (B) Antigenic map demonstrating that SWZ13-like and HK14-like viruses are co-circulating along two different directions. Some of the viruses that emerged in 2015–2016 are labeled. Markers with white face and black edge indicate the estimated centers for each antigenic cluster. (C) Estimated mutations leading to antigenic drift events TX12 !SWZ13 and TX12 !HK14. An intermediate mutation (i.e. a double-mutation from TX12 vaccine strain) was identified. (D) Bench hemagglutination inhibition value-based antigenic cartography. The three key mutants illustrated in panel C, are demonstrated in this bench validation. Markers with white face and black edge indicate the estimated centers for each antigenic cluster Graph-guided multi-task sparse learning model 85 Table 3. Bench serologic results based on ferret serum for validating the predicted antigenically associated residues Virus Ferret Serum Br/07 Perth/09 Vic/11 TX/12 SWZ/13 Utah/13 CR/13 HK/14 Palau/14 FJ/15 Vic/15 Br/15 TX-A138S-F159Y <10 40 160 160 160 80 <10 40 160 160 <10 <10 TX-N145S-N225D <10 640 640 640 160 640 160 320 320 640 80 160 TX-F159Y-Q311H 40 160 320 320 80 320 80 640 80 640 160 320 TX-N145S-N225D-A138S <10 320 640 640 160 320 160 160 160 640 80 160 TX-N145S-N225D-A138S-F159Y <10 <10 160 160 160 160 <10 80 160 160 <10 80 TX-N145S-N225D-F159Y-Q311H <10 320 640 640 320 640 80 640 640 1280 160 640 TX-N144S-N145S-N225D-Q311H-F159Y 10 320 640 640 320 640 160 1280 160 2560 320 1280 TX WT <10 640 10 1280 160 320 320 160 160 640 640 160 TX-N128A <10 1280 10 1280 320 640 320 640 320 1280 1280 320 TX-A138S <10 640 10 640 80 320 160 160 160 640 640 80 TX-R142G <10 1280 10 1280 320 640 320 320 160 640 1280 160 TX-N145S <10 640 10 1280 320 640 160 320 320 640 640 320 TX-F159S <10 10 10 320 160 10 10 80 160 320 160 40 TX-N225D <10 1280 10 2560 320 640 320 320 320 640 1280 320 TX-N144S-N145S <10 640 10 1280 320 640 320 640 320 1280 1280 320 Br/07 1280 <10 <10 <10 <10 <10 <10 <10 <10 <10 <10 <10 CR/13 <10 160 160 80 20 40 160 320 <10 320 80 80 FJ/15 <10 80 160 160 <10 <10 80 640 <10 640 20 160 HK/14 <10 160 320 160 <10 <10 160 1280 <10 1280 40 320 Palau/14 <10 <10 <10 160 320 320 80 80 1280 320 80 80 Perth/09 <10 640 <10 160 <10 80 80 80 <10 320 80 40 SWZ/13 <10 160 320 320 640 640 40 320 640 1280 <10 160 TX/12 <10 640 <10 1280 160 320 160 160 80 640 640 640 Utah/13 <10 320 <10 640 160 1280 80 320 160 640 1280 160 Vic/11 <10 320 640 640 80 160 80 160 80 320 <10 80 Vic/15 20 160 40 80 <10 <10 160 1280 <10 640 640 320 Note: Viruses propagated in Madin–Darby canine kidney cells. The results confirmed the antigenic difference among viruses in TX12, SWZ13 and HK14 clus- ters and the intermediate virus, TX-N145S-N225D (shown in bold). co-circulating antigenic variants, SWZ13-like and HK14-like viruses 4 Discussion (Fig. 3). This study presents a robust genomic sequence-based method for Prediction performance of the GG-MTSL system procedure quantifying antigenic distances. This method enables the rapid char- could also be validated by comparing the correlations between anti- acterization of antigenic profiles and identification of antigenic var- genic maps generated from sequence data (prediction) and serologic iants for influenza viruses in real time and on a large scale. In data (real data). Antigenic cartography shown in Figure 4 and addition, since sequences can be generated directly by using clinical Supplementary Figure S4 were generated from HA sequences and samples, this method can help minimize biases due to culture- serologic data (HI), respectively. In sequence-based prediction car- adapted mutation during virus isolation (Stevens et al., 2010). This tography, all serologically tested antigenic clusters showing a clear method also allows for the inclusion of uncultivable virus samples evaluation pattern of those clusters could be observed, and the pat- into the analyses. Furthermore, multi-task learning allows for the in- tern matched well with the patterns for serologically tested sequen- dependent characterization of serologic datasets from multiple sour- ces from each major antigenic cluster. ces, which are usually difficult to integrate due to various factors, such as types and batches of biologic materials (e.g. reference anti- 3.4 Large-scale, sequence-based prediction infers serum and erythrocytes) and supplies (e.g. plates) and variations in re-emerging H3N2v antigenic variant the protocol implementation by personnel (Yuan et al., 2013). Based on key mutations identified from GG-MTSL, a sequence- Another advantage to using multi-task learning is that it makes it based prediction could not only predict/infer the antigenic distances possible to use all available data and could help avoid local opti- and the relationships among all historical H3N2 human viruses, but mization (referred to as ‘overfitting’) and false positive results. also could identify re-emergence events in history. Specifically, In the past years, a few attempts at influenza antigenic variant H3N2v-like viruses were predicted to be antigenically similar to the prediction based on HI data were reported. For example, Lee and BE92 cluster, and such results had been confirmed in previous stud- Chen (2004) developed a simple correlation method between HA ies (Sun et al., 2013). In addition, antigenic cartography identified titer and the number of mutations between test viral HA and refer- an H3N2v-like variant that was antigenically similar to viruses in ence viral HA. Liao et al. (2008) applied multiple regression and lo- clusters BE92-SY97 (Fig. 4). The H3N2v-like variant was identified gistic regression between mutations and HI values; Huang et al. in the summer of 2011 at agricultural fairs and caused 2055 infec- (2009) developed a decision tree algorithm in drift variant predic- tions among humans in the United States during August 2011–April tion by deriving association rules from HI data based on informa- 2012 (Biggerstaff et al., 2013). This H3N2v virus was possibly tion theory. In our previous work, we developed a sparse learning transmitted from humans to swine in the mid-1990s and then re- method to identify antigenicity-associated residues by using sero- emerged in humans in 2011 (Feng et al., 2013). logic data and formulated this sparse learning problem as an 86 L.Han et al. Fig. 4. Genetic and antigenic drift of seasonal influenza A(H3N2) viruses (1968–2016). (A) Phylogenetic tree of HA genes for H3N2 viruses showed a continuous natural selection leading to a major truck structure. (B) Antigenic map of 39 370 viruses demonstrating zig-zag ‘S’ shape for antigenic relationship among H3N2 viruses. A total of 16 antigenic clusters were identified by using a spectrometry clustering program. (C) Detection of one potential antigenic variant H3N2v-like viruses. H3N2v-like viruses, which emerged in 2011, were antigenically similar to WU95 optimization problem that measures the correlation between the (Russell et al., 2008). The continuity of antigenic variations presents antigenic distance changes in serologic data and antigenic profiling great challenges for identifying and defining a virus as an antigenic by using a scoring function that characterizes the magnitude of variant. Although the large set of serologic and genetic data reflects mutations in protein sequences (Cai et al., 2012; Han et al., 2016; the complete picture of viral evolution, it also complicates identifica- Sun et al., 2013; Yang et al., 2014). Instead of predicting antigenic tion of antigenic variants during the vaccine strain selection process. distances among viruses, Neher et al. developed a sparse learning Of interest, this study suggested that two variants (genetic clade method to predict the HI titers for pairs of antigen and sera (Neher C3.2a and clade C3.3a; antigenic clusters HK14 and SWZ13) et al., 2016). Nevertheless, these methods treat the data analyses or emerged in 2013 and then co-circulated during the subsequent three learning as a single task and require data integration; therefore, influenza seasons, with one variant predominating in some regions these methods face the associated challenges previously described. and the other predominating in other regions. These two genetic var- Thus, the GG-MTSL method presented in this study is unique iants are antigenically distinct (Fig. 3), and the extent to which an from other available methods, and our findings show that GG- SWZ13-like vaccine would be effective against a HK14-like viruses, MTSL performed superiorly over a single task-sparse learning and vice-versa, is not known. It is also not known how long these method, indicating the effectiveness of the multi-task strategy two antigenic variants will continue to co-circulate among humans. (Supplementary Table S1). Co-circulation of multiple antigenic variants presents great chal- By using the antigenic characterization results of 39 370 H3N2 lenges in vaccine strain selection (Ampofo et al., 2012). viruses recovered from patients during 1968–2016, we showed that In this study, we detected one re-emerging H3N2 variant; this the GG-MTSL system proposed in this study identified 16 antigenic finding creates another challenge in influenza surveillance by adding clusters of subtype H3N2 influenza virus (Fig. 4 and Supplementary another layer of complexity in antigenic variant detection. For ex- Fig. S4) and showed the dynamics of antigenic evolution of these ample, among the swine population in North America, the current viruses. The results of our large-scale and sequence-based antigenic predominant influenza A(H3N2) virus is associated with a spillover cartography suggest that antigenic evolution of H3N2 viruses is of human seasonal H3N2 viruses to pigs in the 1990s (Zhou et al., much less punctuated than it used to be (Shih et al., 2007), as 1999). In the past 2 decades, genomic analyses suggested at least 22 confirmed by antigenic maps derived from serologic assays introductions of influenza A viruses from humans to swine, eight of Graph-guided multi-task sparse learning model 87 Lee,M.-S. and Chen,J.S.-E. (2004) Predicting antigenic variants of influenza which were human seasonal subtype H3N2 viruses (Shu et al., A/H3N2 viruses. Emerg. Infect. Dis., 10, 1385. 2012). Uncertainty surrounding the emergence of such variants at Liao,Y.-C. et al. (2008) Bioinformatics models for predicting antigenic var- the human-swine interface increases the need for surveillance cover- iants of influenza A/H3N2 virus. Bioinformatics, 24, 505–512. age beyond urban areas with dense human populations. Lin,Y.P. et al. (2010) Neuraminidase receptor binding variants of human influenza A (H3N2) viruses resulting from substitution of aspartic acid 151 in the catalytic site: a role in virus attachment? J. Virol., 84, Acknowledgements 6769–6781. This project was supported by the National Institutes of Health [grant num- Liu,J. et al. (2009). Multi-task feature learning via efficient l 2, 1-norm mini- ber R01AI116744]. We thank Dr. Scott Hensley for providing the HA plas- mization. In Proceedings of the Twenty-Fifth Conference on Uncertainty in mids of influenza TX12 virus and Dr. Hang Xie for providing the HA Artificial Intelligence, pp. 339–348. plasmids of influenza SWZ13 and HK14 viruses. Mansfield,K. (2007) Viral tropism and the pathogenesis of influenza in the mammalian host. Am. J. Pathol., 171, 1089–1092. Conflict of Interest: none declared. Neher,R.A. et al. (2016) Prediction, dynamics, and visualization of antigenic phenotypes of seasonal influenza viruses. Proc. Natl. Acad. Sci. USA, 113, E1701–E1709. References Ng,A.Y. et al. (2002) On spectral clustering: analysis and an algorithm. Adv. Ampofo,W.K. et al. (2012) Improving influenza vaccine virus selectionreport Neural Inf. Process. Syst., 2, 849–856. of a WHO informal consultation held at WHO headquarters, Geneva, Ren,X. et al. (2015) Computational identification of antigenicity-associated Switzerland, 14–16 June 2010. Influenza Other Respir. Viruses, 6, 142–152. sites in the hemagglutinin protein of A/H1N1 seasonal influenza virus. PLoS Bao,Y. et al. (2008) The influenza virus resource at the national center for bio- One, 10, e0126742. technology information. J. Virol., 82, 596–601. Russell,C.A. et al. (2008) The global circulation of seasonal influenza A Barnett,J.L. et al. (2012) Antigenmap 3d: an online antigenic cartography re- (H3N2) viruses. Science, 320, 340–346. source. Bioinformatics, 28, 1292–1293. Shih,A.C.-C. et al. (2007) Simultaneous amino acid substitutions at antigenic Beck,A. and Teboulle,M. (2009) A fast iterative shrinkage-thresholding algo- sites drive influenza A hemagglutinin evolution. Proc. Natl. Acad. Sci. USA, rithm for linear inverse problems. SIAM J. Imaging Sci., 2, 183–202. 104, 6283–6288. Biggerstaff,M. et al. (2013) Estimates of the number of human infections with Shu,B. et al. (2012) Genetic analysis and antigenic characterization of swine influenza A (H3N2) variant virus, United States, August 2011–April 2012. origin influenza viruses isolated from humans in the united states, Clin. Infect. Dis., 57, S12–S15. 1990–2010. Virology, 422, 151–160. Cai,Z. et al. (2012) Identifying antigenicity-associated sites in highly patho- Shu,Y. and McCauley,J. (2017) Gisaid: global initiative on sharing all influ- genic H5N1 influenza virus hemagglutinin by using sparse learning. J. Mol. enza data–from vision to reality. Eurosurveillance, 22, pii: 30494. Biol., 422, 145–155. Smith,D.J. et al. (1999) Variable efficacy of repeated annual influenza vaccin- Cai,Z. et al. (2010) A computational framework for influenza antigenic car- ation. Proc. Natl. Acad. Sci. USA, 96, 14001–14006. tography. PLoS Comput. Biol., 6, e1000949. Smith,D.J. et al. (2004) Mapping the antigenic and genetic evolution of influ- Chen,X. et al. (2012) Smoothing proximal gradient method for general struc- enza virus. Science, 305, 371–376. tured sparse regression. Ann. Appl. Stat., 6, 719–752. Smith,R.F. and Smmith,T.F. (1992) Pattern-induced multi-sequence alignment Feng,Z. et al. (2013) Antigenic characterization of H3N2 influenza A viruses (PUMA) algorithm employing secondary structure-dependent gap penalties from Ohio agricultural fairs. J. Virol., 87, 7655–7667. for use in comparative protein modelling. Protein Eng. Des. Sel., 5, 35–41. Han,L. and Zhang,Y. (2015) Learning multi-level task groups in multi-task Squires,R.B. et al. (2012) Influenza research database: an integrated bioinfor- learning. In Proceedings of the Thirtieth AAAI Conference on Artificial matics resource for influenza research and surveillance. Influenza Other Intelligence (AAAI), pp. 2638–2644. Respir. Viruses, 6, 404–416. Han,L. and Zhang,Y. (2016) Multi-stage multi-task learning with reduced Stevens,J. et al. (2010) Receptor specificity of influenza A H3N2 viruses isolated rank. In Proceedings of the Thirtieth AAAI Conference on Artificial in mammalian cells and embryonated chicken eggs. J. Virol., 84, 8287–8299. Intelligence (AAAI), pp. 1638–1644. Sun,H. et al. (2013) Using sequence data to infer the antigenicity of influenza Han,L. et al. (2016). Generalized hierarchical sparse model for arbitrary-order virus. MBio, 4, e00230-13–e00213. interactive antigenic sites identification in flu virus data. In Proceedings of Thompson,M. et al. (2010) Estimates of deaths associated with seasonal the 22nd ACM SIGKDD International Conference on Knowledge influenza-United States, 1976–2007. Morb. Mortal. Wkly Rep., 59, Discovery and Data Mining, ACM, pp. 865–874. 1057–1062. Harper,S.A. et al. (1984) Prevention and control of influenza. Thompson,W.W. et al. (2004) Influenza-associated hospitalizations in the Harvey,W.T. et al. (2016) Identification of low-and high-impact hemagglutin- United States. JAMA, 292, 1333–1340. in amino acid substitutions that drive antigenic drift of influenza A (H1N1) Yang,J. et al. (2014) Sequence-based antigenic change prediction by a sparse viruses. PLoS Pathog., 12, e1005526. learning method incorporating co-evolutionary information. PLoS One, 9, Huang,J.-W. et al. (2009) Co-evolution positions and rules for antigenic var- e106660. iants of human influenza A/H3N2 viruses. BMC Bioinformatics, 10, S41. Yuan,X.-T. et al. (2013) A joint matrix completion and filtering model for in- Jaggi,M. et al. (2010). A simple algorithm for nuclear norm regularized prob- fluenza serological data integration. PLoS One, 8, e69842. lems. In Proceedings of the 27th international conference on machine learn- Zhou,N.N. et al. (1999) Genetic reassortment of avian, swine, and human in- ing (ICML-10), pp. 471–478. fluenza A viruses in american pigs. J. Virol., 73, 8851–8856.
Bioinformatics – Oxford University Press
Published: Jun 7, 2018
You can share this free article with as many people as you like with the url below! We hope you enjoy this feature!
Read and print from thousands of top scholarly journals.
Already have an account? Log in
Bookmark this article. You can see your Bookmarks on your DeepDyve Library.
To save an article, log in first, or sign up for a DeepDyve account if you don’t already have one.
Copy and paste the desired citation format or use the link below to download a file formatted for EndNote
Access the full text.
Sign up today, get DeepDyve free for 14 days.
All DeepDyve websites use cookies to improve your online experience. They were placed on your computer when you launched this website. You can change your cookie settings through your browser.