Abstract During the demographic history of the Pan clade, there has been gene-flow between species, likely >200,000 years ago. Bonobo haplotypes in three subspecies of chimpanzee have been identified to be segregating in modern-day chimpanzee populations, suggesting that these haplotypes, with increased differentiation, may be a target of natural selection. Here, we investigate signatures of adaptive introgression within the bonobo-like haplotypes in chimpanzees using site frequency spectrum-based tests. We find evidence for subspecies-specific adaptations in introgressed regions involved with male reproduction in central chimpanzees, the immune system in eastern chimpanzees, female reproduction and the nervous system in Nigeria-Cameroon chimpanzees. Furthermore, our results indicate signatures of balancing selection in some of the putatively introgressed regions. This might be the product of long-term balancing selection resulting in a similar genomic signature as introgression, or possibly balancing selection acting on alleles reintroduced through gene flow. chimpanzee, introgression, natural selection, balancing selection Introduction Introgression events, or the admixture between two closely related species has been described previously in plants and animals (Arnold and Martin 2009), between humans and Neanderthals (Smith et al. 2005; Green et al. 2010), Denisovans (Reich et al. 2011), and an unknown hominin (Mondal et al. 2016), and most recently between chimpanzees and bonobos (de Manuel et al. 2016). After an introgression event occurs, two scenarios, outside of neutrality, are possible. Either the new haplotype is not beneficial to the new species or the new variants may allow for some adaptive advantage. The first case, which is more common, tends to be removed through purifying selection or drift. This was described both for introgressed segments in the human genome and for introgressed segments in the chimpanzee genome (Sankararaman et al. 2014; de Manuel et al. 2016; Juric et al. 2016; Kuhlwilm et al. 2016). The second, rare case, describes adaptive introgression, where variants introduced through introgression will increase in frequency through adaptive selection. Exploring the role that introgression plays in the evolution of a species has been limited (Arnold and Martin 2009; Hedrick 2013). However, some examples of adaptive advantage exist; for instance, the increased resistance to herbivores in the sunflower (Whitney et al. 2006), or warfarin resistance in mice (Song et al. 2011). Within humans, introgression from Denisovans allowed for adaptation to high altitudes in Tibetans (Huerta-Sanchez et al. 2014) and Neanderthal admixture benefits to the immune response (Dannemann et al. 2016; Deschamps et al. 2016). Here, we present the first study of selection specifically in introgressed regions of the chimpanzee genome, using the results from de Manuel et al. (2016). After a selection scan in the introgressed regions, an enrichment test is used in order to investigate possible phenotypes selected, identifying candidate genes that may have an adaptive advantage. Conversely, we also discuss the possibility that a few of the introgressed regions may indicate the presence of long-term balancing selection. Overall, we discuss the possibility that introgression allows for natural selection to act on a highly differentiated fraction of the genome. Materials and Methods Full genome sequences from 18 P.t. troglodytes, 19 P.t. schweinfurthii, 10 P.t. ellioti, and 10 bonobos were obtained from a previous study (Prado-Martinez et al. 2013); as well as the coordinates of introgressed regions (de Manuel et al. 2016). We calculated Tajima’s D (Tajima 1989), Fu and Li’s D and F (Fu and Li 1993) for the whole genome of each population. All statistics were calculated in 30-kb windows with a 3-kb sliding window. Windows were masked if less than five SNPs were present, following Pybus et al. (2014, 2015). The windows were deemed as introgressed if any part was within an introgressed region. Full methods available in the Supplementary Material online. Results Following de Manuel et al. (2016), 54.7 Mb of the chimpanzee genome are detected with a signature of bonobo introgression in any non-P.t. verus individual. These segments are significantly depleted of genic content (P < 2.2 × 10−16, Wilcoxon rank test), that is, they contain fewer protein-coding bases than found in random regions, similar to patterns of gene flow in the human lineage (Sankararaman et al. 2014). However, some protein coding substitutions are present (supplementary table S1, supplementary information 1.1, Supplementary Material online). Selection Scans of the Introgressed Regions As background selection will remove the majority of introgressed haplotypes, in order to test the strength of background selection, we compared the B scores (McVicker et al. 2009) of the introgressed segments as compared with random regions across the genome. In introgressed fragments, these scores are significantly higher (P < 2.2 × 10−16, Wilcoxon rank test), showing weaker background selection. This suggests that bonobo alleles were more often tolerated in neutrally evolving regions of the genome, analogous to Neanderthal haplotypes in modern humans (Sankararaman et al. 2016; Vernot et al. 2016). To detect adaptive evolution, three selection tests (Tajima’s D, Tajima 1989 and Fu and Li’s D and F, Fu and Li 1993) based on site frequency spectrum (SFS) analysis were applied. When comparing the genome-wide distribution of the SFS-based statistics with the introgressed portion, in all cases, the three selection tests are shifted toward positive values within the introgressed regions (fig. 1). A deviation in the positive direction indicates a higher proportion of intermediate frequency variants, as seen in the genome-wide distribution of the SFS (supplementary fig. S1, Supplementary Material online). This shift toward positive values cannot be explained by sampling (supplementary fig. S2, supplementary information 1.2, Supplementary Material online). The shift in SFS values may be due either to balancing selection or what is expected under introgression, and may be difficult to discern between the two without additional information (see below). Fig. 1. View largeDownload slide —Statistical distribution of Tajima’s D, Fu and Li’s D and F. Black represents the whole genome and yellow represents the introgressed regions. (A) Pan troglodytes troglodytes. (B) Pan troglodytes schweinfurthii. (C) Pan troglodytes ellioti. Fig. 1. View largeDownload slide —Statistical distribution of Tajima’s D, Fu and Li’s D and F. Black represents the whole genome and yellow represents the introgressed regions. (A) Pan troglodytes troglodytes. (B) Pan troglodytes schweinfurthii. (C) Pan troglodytes ellioti. In order to detect the genes (and the possible underlying functions) that have been under selection, we use an outlier approach in the introgressed regions of each of the subspecies, selecting the 25 minimum (regions likely under a selective sweep) and maximum (regions under adaptive introgression or balancing selection) regions for each statistic (supplementary tables S2–S7, Supplementary Material online). It is important to note here, that when compared with the whole genome, the extreme introgressed regions fall into the top 0.7% for the positive tail and the bottom 0.4% for the negative tail. Positively Selected Introgression—Negative Tail and Tissue Enrichment For each of the subspecies, we have a list of genes in or closest to the introgressed regions (supplementary tables S2–S4, Supplementary Material online) that have extreme negative values for the three selection tests and thus are very likely to have evolved under positive selection. We first performed an enrichment analysis with DEPICT (Pers et al. 2015) in order to relate the set of genes to specific tissues and likely, to a phenotype. We then explore the genes in or closest to each region through a literature search to identify genes with a known association either by function, by expression (among top three median expression when comparing across all tissues as obtained by GTEx; GTEx Consortium 2017), or likely inclusion in functional pathways as obtained by the PathCards database (Belinky et al. 2015). For P.t. troglodytes (supplementary table S2, Supplementary Material online), 25 extreme negative regions for each statistic were selected, and due to overlaps between statistics, 38 total regions were considered with 35 unique genes. We find two (medical subject heading) MeSH terms (table 1) from the tissue enrichment analysis that are significant (P < 0.05). These tissues are testis and the retina. For the first group (testis), there are a total of 13 genes of interest. For eye-related sense organs there are a total of two regions with confirmed function in eye-related traits (fig. 2A and supplementary table S2, supplementary information 2.1, Supplementary Material online). Thus, among the regions that were introgressed from bonobo into P.t. troglodytes, there is an enrichment of genes whose phenotypes are related to maleness, with interesting genes that are highly expressed in testis and may have had a special male reproductive function. Table 1 DEPICT Algorithm Results for Tissue Expression of Introgressed Regions in Each Subspecies MeSH First Level MeSH Second Level MeSH Number P Value Pan troglodytes troglodytes Testis Endocrine system A06.407.312.782 0.028 Retina Sense organs A09.371.729 0.04 Pan troglodytes schweinfurthii Synovial fluid Musculoskeletal system A02.835.583.443.800.800 7.487×10−5 Monocytes Hemic and immune systems A15.378.316.580 0.007 Myeloid cells Cells A11.627 0.028 Mononuclear phagocyte system Hemic and immune systems A15.382.812 0.031 Umbilical veins Cardiovascular system A07.231.908.670.874 0.032 Portal system Cardiovascular system A07.231.908.670 0.032 Bone marrow cells Hemic and immune systems A15.378.316 0.036 Hematopoietic system Hemic and immune systems A15.378 0.036 Endothelial cells Cells A11.436.275 0.037 Phagocytes Hemic and immune systems A15.382.680 0.042 Blood Hemic and immune systems A15.145 0.043 Veins Cardiovascular system A07.231.908 0.047 Pan troglodytes ellioti Oocytes Cells A11.497.497.600 0.042 Ovum Urogenital System A05.360.490.690 0.042 Neural stem cells Cells A11.872.653 0.046 Retina Sense Organs A09.371.729 0.046 MeSH First Level MeSH Second Level MeSH Number P Value Pan troglodytes troglodytes Testis Endocrine system A06.407.312.782 0.028 Retina Sense organs A09.371.729 0.04 Pan troglodytes schweinfurthii Synovial fluid Musculoskeletal system A02.835.583.443.800.800 7.487×10−5 Monocytes Hemic and immune systems A15.378.316.580 0.007 Myeloid cells Cells A11.627 0.028 Mononuclear phagocyte system Hemic and immune systems A15.382.812 0.031 Umbilical veins Cardiovascular system A07.231.908.670.874 0.032 Portal system Cardiovascular system A07.231.908.670 0.032 Bone marrow cells Hemic and immune systems A15.378.316 0.036 Hematopoietic system Hemic and immune systems A15.378 0.036 Endothelial cells Cells A11.436.275 0.037 Phagocytes Hemic and immune systems A15.382.680 0.042 Blood Hemic and immune systems A15.145 0.043 Veins Cardiovascular system A07.231.908 0.047 Pan troglodytes ellioti Oocytes Cells A11.497.497.600 0.042 Ovum Urogenital System A05.360.490.690 0.042 Neural stem cells Cells A11.872.653 0.046 Retina Sense Organs A09.371.729 0.046 Table 1 DEPICT Algorithm Results for Tissue Expression of Introgressed Regions in Each Subspecies MeSH First Level MeSH Second Level MeSH Number P Value Pan troglodytes troglodytes Testis Endocrine system A06.407.312.782 0.028 Retina Sense organs A09.371.729 0.04 Pan troglodytes schweinfurthii Synovial fluid Musculoskeletal system A02.835.583.443.800.800 7.487×10−5 Monocytes Hemic and immune systems A15.378.316.580 0.007 Myeloid cells Cells A11.627 0.028 Mononuclear phagocyte system Hemic and immune systems A15.382.812 0.031 Umbilical veins Cardiovascular system A07.231.908.670.874 0.032 Portal system Cardiovascular system A07.231.908.670 0.032 Bone marrow cells Hemic and immune systems A15.378.316 0.036 Hematopoietic system Hemic and immune systems A15.378 0.036 Endothelial cells Cells A11.436.275 0.037 Phagocytes Hemic and immune systems A15.382.680 0.042 Blood Hemic and immune systems A15.145 0.043 Veins Cardiovascular system A07.231.908 0.047 Pan troglodytes ellioti Oocytes Cells A11.497.497.600 0.042 Ovum Urogenital System A05.360.490.690 0.042 Neural stem cells Cells A11.872.653 0.046 Retina Sense Organs A09.371.729 0.046 MeSH First Level MeSH Second Level MeSH Number P Value Pan troglodytes troglodytes Testis Endocrine system A06.407.312.782 0.028 Retina Sense organs A09.371.729 0.04 Pan troglodytes schweinfurthii Synovial fluid Musculoskeletal system A02.835.583.443.800.800 7.487×10−5 Monocytes Hemic and immune systems A15.378.316.580 0.007 Myeloid cells Cells A11.627 0.028 Mononuclear phagocyte system Hemic and immune systems A15.382.812 0.031 Umbilical veins Cardiovascular system A07.231.908.670.874 0.032 Portal system Cardiovascular system A07.231.908.670 0.032 Bone marrow cells Hemic and immune systems A15.378.316 0.036 Hematopoietic system Hemic and immune systems A15.378 0.036 Endothelial cells Cells A11.436.275 0.037 Phagocytes Hemic and immune systems A15.382.680 0.042 Blood Hemic and immune systems A15.145 0.043 Veins Cardiovascular system A07.231.908 0.047 Pan troglodytes ellioti Oocytes Cells A11.497.497.600 0.042 Ovum Urogenital System A05.360.490.690 0.042 Neural stem cells Cells A11.872.653 0.046 Retina Sense Organs A09.371.729 0.046 Fig. 2. View largeDownload slide —Proportion of genes in or nearest to each introgressed region with DEPICT tissue enrichment analysis through a literature search. (A) Pan troglodytes troglodytes. (B) Pan troglodytes schweinfurthii. (C) Pan troglodytes ellioti. Fig. 2. View largeDownload slide —Proportion of genes in or nearest to each introgressed region with DEPICT tissue enrichment analysis through a literature search. (A) Pan troglodytes troglodytes. (B) Pan troglodytes schweinfurthii. (C) Pan troglodytes ellioti. In the case of P.t. schweinfurthii (supplementary table S3, Supplementary Material online), the 25 most extreme regions for each statistic are combined in 38 regions and the DEPICT algorithm of enrichment gives 12 MeSH terms (table 1) from the tissue enrichment analysis that are significant. These 12 tissues are involved with the Synovial Compartment, Immune System, Cardiovascular System, and Blood. We find, for phenotypes related to blood and immune function as well as the synovial compartment, 16 interesting genes among the 36 considered (fig. 2B and supplementary table S3, supplementary information 2.2, Supplementary Material online). The overall result of the genes under positive selection in the introgressed regions points to immunity, specifically of genes expressed in white blood cells. Many different genes have been preserved and driven, by selection, into an immunity response that may have benefited from the past exposure in another species. In the P.t. ellioti subspecies (supplementary table S4, Supplementary Material online), the 25 most extreme regions for each statistic overlap resulting in 30 total regions and 28 genes. When analyzed by DEPICT, four MeSH terms (table 1) are significant in the tissue enrichment analysis. They include Oocyte and Ovum, the Nervous System, and the Retina. For the first two tissue categories, oocyte and ovum, we find eight genes of note. For the second phenotypic category, neural stem cells, 15 genes have some relationship to brain phenotypes in the literature. Lastly, for eye-related sense organs, two genes have been shown to play a role in the development of the retina (fig. 2C and supplementary table S4, supplementary information 2.3, Supplementary Material online). Thus, for P.t. ellioti, we see enrichment in genes related to femaleness and neural functions. Adaptive Introgression and Balancing Selection—Positive Tail The segments identified in the positive tail consist of regions both under long-term balancing selection and adaptive introgression, that is, regions that are considered as introgressed because they are in excess of heterozygotes. Because these two biological processes are distinct it is not logical to apply any enrichment test to these regions as a cohesive group. We discuss here genes in the extreme positive tail with the unique combination that their coding region falls within the introgressed region (supplementary tables S5–S7, column 6, Supplementary Material online) and where the corresponding region in bonobo (supplementary tables S8–S10, supplementary information 3.1–3.3, Supplementary Material online) has statistical values near zero (indicating no significant selection of any kind). These regions can be considered as adaptive introgression because they are segregating at intermediate frequencies and have not been removed by purifying selection nor been subject to long-term balancing selection during the history of the Pan clade. Their evolutionary significance may be either recent balancing selection or a complex situation leading to an excess of heterozygotes. We discuss the specific candidate genes under an adaptive advantage in the supplementary information 3.1–3.3, Supplementary Material online. Overall, regions with signatures of adaptive introgression and segregating at intermediate frequencies in chimpanzees but not in bonobos tend to lie within protein-coding regions, and several of these genes contain protein-coding substitutions (supplementary table S1, Supplementary Material online). This underscores the interesting evolutionary benefit that may arise through introgression between species. Conversely, we classify regions as under long-term balancing selection when they are shared among all subspecies of chimpanzee and bonobo. To find the most interesting candidates, first, we selected the most extreme 25 positive regions for one of the three statistics in each subspecies of chimpanzee (Top 0.4% tail genome-wide; supplementary tables S5–S7, Supplementary Material online) and kept regions only when the corresponding region in the bonobo genome has a value for one of the statistics in the top 1% tail of its genome-wide distribution (supplementary tables S11, Supplementary Material online); indicated in the last column of supplementary tables S5–S7, Supplementary Material online. Out of the 21 unique regions with these conditions, the majority (17/21) are also in the top 5% genome-wide tail for at least another subspecies (supplementary fig. S3 and table S11, last two columns, Supplementary Material online) and some are in the top fraction for all: STEAP1, UNC5D, RIOK2, ZWINT, PCDH9, COL11A1, and SLC16A7. Based on this combination, it is likely that these regions have been undergoing long-term balancing selection during the evolutionary history of the Pan clade. Due to this type of selective force, these regions have an increased genetic diversity as well as unusually old lineages (Charlesworth 2006). Interestingly, some regions under long-term balancing selection in the Pan clade (supplementary information 4, supplementary tables S5–S7, Supplementary Material online) are falling outside of coding regions, implicating the possibility of the importance of balancing selection in regulatory regions. Discussion We have established that the introgressed regions of bonobo genome into chimpanzee are depleted of purifying selection indicating that the haplotypes that have survived do not confer a negative fitness within the chimpanzee genome. Previous studies (Arnold and Martin 2009) have identified reproductive traits as candidates for adaptive introgression in different species. This mechanism may increase the viability of hybrid offspring and counter reproductive isolation. There have been well-established differences between the reproductive traits of bonobos and chimpanzees, which may make gene flow difficult. For instance, the estrous cycle of the two species is understood to be quite different, with prolonged sexual swellings in bonobo females (Furuichi 1987), discrepancies in ovulation timing (Ryu et al. 2015), and the age of menarche differing by 3 years (Behringer et al. 2014). Two MeSH terms for P.t. ellioti are involved with ovulation, six genes are highly expressed in the female reproductive organs, and two genes have a known function within the female cycle. The most striking enrichment for adaptive introgression in P.t. troglodytes falls within male-related tissues and genes. When comparing chimpanzees and bonobos, some evidence indicates a differentiation between the ratio of body size and testis size and with midpiece volume (the part of a spermatozoon between the head and the tail; Anderson and Dixson 2002). Furthermore, testosterone levels are known to fluctuate in chimpanzee males while remaining consistent in bonobos (Wobber et al. 2012). Although no evidence is available for subspecies-specific phenotypes related to male reproduction, selection within these introgressed regions in P.t. troglodytes may indicate a benefit gained from bonobo introgression. We indeed find 15 genes with implication in testis phenotypes, specifically four genes having functional importance. We also find two protein-coding substitutions in genes involved in fertility, which might be strong candidates for adaptation. This evidence together indicates that selection may have acted to facilitate the admixture of these two differentiated species. In general, the majority of evidence for differences between the two species is in behavior (as reviewed by Gruber and Clay 2016). Behavioral traits are of course extremely complex and difficult to research, however, we do want to highlight that one of the MeSH terms from enrichment in P.t. ellioti is neural stem cells; and that nine of the closest genes are implicated in some brain function, specifically with seven of those functionally involved in the development of the nervous system. Behaviorally bonobos are viewed as “delayed” because they exhibit playful behavior throughout life (Hare et al. 2012) and due to slower development of endocranial volume in juveniles (Durrleman et al. 2012). These two species are well-established to be social animals and the selection of bonobo haplotypes may also have had an impact on allowing their gene flow. Interesting, P.t. schweinfurthii has MeSH terms involved in immune function. This, on the surface, is surprising as it is well-established that P.t. schweinfurthii and bonobos live in adjacent habitats; and would therefore come in contact with extremely similar pathogens. However, evidence suggests that this subspecies specifically is a reservoir for Simian Immunodeficiency Virus (SIV, the Pan equivalent to HIV) while no wild infections have been found in bonobos (Li et al. 2012), indicating a difference in their immune response to viral infection. Intriguingly, four genes closest to introgressed regions of P.t. schweinfurthii were placed in the HIV life cycle according to PathCards (Belinky et al. 2015); with the addition that six closest genes have known function within the immune system, indicating a likely boost to the immune system in this subspecies. This study identifies 13 genes as under adaptive introgression or balancing selection that were previously identified as under selection in chimpanzees; specifically, ADAM22, FAM162B, VPS41, INPP4B, NCAM2, SORCS1, ANO2, CCSER1, ZPLD1, HS6ST3, and GALNTL6 (positive selection) and HLA-DQA1 and AKR1E2 (balancing selection; Cagan et al. 2016). On the whole, the regions which contain introgression tend to fall into the extreme tails genome-wide (well within the extreme 1%). It is therefore likely for these specific genes to be classified as under selection. Finally, long-term balancing selection can be rare and difficult to identify (Charlesworth 2006) and here we have found regions that may have interesting features. There is evidence for balancing selection occurring outside coding regions; specifically in a promoter (Wilson et al. 2006), in the 5′ regulatory region ∼2 kb outside the coding region (Bamshad 2002), and upstream ∼4 kb outside the coding region (Sun et al. 2011). However, no study has satisfactorily attributed this evolutionary force this far outside of genes and for several regions, while here, we identify 15 regions in the extreme tails which are between 30 and 600 kb to the closest gene. These data, however, are uniquely rich and show similar patterns of balancing selection between two species (and, in most cases, across the three subspecies), making a strong case for the importance of long-term long-distance balancing selection. Supplementary Material Supplementary data are available at Genome Biology and Evolution online. Acknowledgments This study has been possible thanks to grant BFU2016-77961-P (AEI/FEDER, UE) awarded by the Agencia Estatal de Investigación (Spain) and with the support of Secretaria d’Universitats i Recerca del Departament d’Economia i Coneixement de la Generalitat de Catalunya (GRC 2014 SGR 866). J.N. was supported by an FI PhD fellowship by Agaur (FI-DGR 2015). M.K. was supported by a DFG fellowship (KU 3467/1-1). T.M.-B. is supported by MINECO BFU2014-55090-P (FEDER), U01 MH106874 grant, Howard Hughes International Early Career and Secretaria d’Universitats i Recerca del Departament d’Economia i Coneixement de la Generalitat de Catalunya. Literature Cited Anderson MJ, Dixson AF. 2002. Sperm competition-Motility and the midpiece in primates. Nature, 416(6880):496–496. Arnold ML, Martin NH. 2009. Adaptation by introgression. J Biol . 8( 9): 82. Google Scholar CrossRef Search ADS PubMed Bamshad MJ. 2002. A strong signature of balancing selection in the 5′ cis-regulatory region of CCR5. Proc Natl Acad Sci U S A . 99( 16): 10539– 10544. Google Scholar CrossRef Search ADS PubMed Behringer V, Deschner T, Deimel C, Stevens JMG, Hohmann G. 2014. Age-related changes in urinary testosterone levels suggest differences in puberty onset and divergent life history strategies in bonobos and chimpanzees. Horm Behav . 66( 3): 525– 533. Google Scholar CrossRef Search ADS PubMed Belinky Fet al. , 2015. PathCards: multi-source consolidation of human biological pathways. Database 2015( 0): bav006. Google Scholar CrossRef Search ADS PubMed Cagan Aet al. , 2016. Natural selection in the great apes. Mol Biol Evol . 33( 12): 3268– 3283. Google Scholar CrossRef Search ADS PubMed Charlesworth D. 2006. Balancing selection and its effects on sequences in nearby genome regions. PLoS Genet . 2( 4): e64– 384. Google Scholar CrossRef Search ADS PubMed Dannemann M, Andres AM, Kelso J. 2016. Introgression of Neandertal- and Denisovan-like haplotypes contributes to adaptive variation in human Toll-like receptors. Am J Hum Genet . 98( 2): 399. Google Scholar CrossRef Search ADS de Manuel Met al. , 2016. Chimpanzee genomic diversity reveals ancient admixture with bonobos. Science 354( 6311): 477– 481. Google Scholar CrossRef Search ADS PubMed Deschamps Met al. , 2016. Genomic signatures of selective pressures and introgression from Archaic hominins at human innate immunity genes. Am J Hum Genet . 98( 1): 5– 21. Google Scholar CrossRef Search ADS PubMed Durrleman S, Pennec X, Trouvé A, Ayache N, Braga J. 2012. Comparison of the endocranial ontogenies between chimpanzees and bonobos via temporal regression and spatiotemporal registration. J Hum Evol . 62( 1): 74– 88. Google Scholar CrossRef Search ADS PubMed Fu YX, Li WH. 1993. Statistical tests of neutrality of mutations. Genetics 133( 3): 693– 709. Google Scholar PubMed Furuichi T. 1987. Sexual swelling, receptivity, and grouping of wild pygmy chimpanzee females at Wamba, Zaire. Primates 28( 3): 309– 318. Google Scholar CrossRef Search ADS Green REet al. , 2010. A draft sequence of the Neandertal genome. Science 328( 5979): 710– 722. Google Scholar CrossRef Search ADS PubMed Gruber T, Clay Z. 2016. A comparison between bonobos and chimpanzees: a review and update. Evol Anthropol . 25( 5): 239– 252. Google Scholar CrossRef Search ADS PubMed GTEx Consortium. 2017. Genetic effects on gene expression across human tissues. Nature 550( 7675): 204– 213. CrossRef Search ADS PubMed Hare B, Wobber V, Wrangham R. 2012. The self-domestication hypothesis: evolution of bonobo psychology is due to selection against aggression. Anim Behav . 83( 3): 573– 585. Google Scholar CrossRef Search ADS Hedrick PW. 2013. Adaptive introgression in animals: examples and comparison to new mutation and standing variation as sources of adaptive variation. Mol Ecol . 22( 18): 4606– 4618. Google Scholar CrossRef Search ADS PubMed Huerta-Sanchez Eet al. , 2014. Altitude adaptation in Tibetans caused by introgression of Denisovan-like DNA. Nature 512( 7513): 194– 197. Google Scholar CrossRef Search ADS PubMed Juric I, Aeschbacher S, Coop G. 2016. The strength of selection against Neanderthal introgression. PLoS Genet . 12( 11): e1006340. Google Scholar CrossRef Search ADS PubMed Kuhlwilm Met al. , 2016. Ancient gene flow from early modern humans into Eastern Neanderthals. Nature 530( 7591): 429– 433. Google Scholar CrossRef Search ADS PubMed Li Yet al. , 2012. Eastern chimpanzees, but not Bonobos, represent a Simian immunodeficiency virus reservoir. J Virol . 86( 19): 10776– 10791. Google Scholar CrossRef Search ADS PubMed McVicker G, Gordon D, Davis C, Green P. 2009. Widespread genomic signatures of natural selection in Hominid evolution. PLoS Genet . 5( 5): e1000471. Google Scholar CrossRef Search ADS PubMed Mondal Met al. , 2016. Genomic analysis of Andamanese provides insights into ancient human migration into Asia and adaptation. Nat Genet . 48( 9): 1066– 1070. Google Scholar CrossRef Search ADS PubMed Pers THet al. , 2015. Biological interpretation of genome-wide association studies using predicted gene functions. Nat Commun . 6: 5890. Google Scholar CrossRef Search ADS PubMed Prado-Martinez Jet al. , 2013. Great ape genetic diversity and population history. Nature 499( 7459): 471– 475. Google Scholar CrossRef Search ADS PubMed Pybus M, et al. 2014. 1000 Genomes Selection Browswer 1.0: a genome browser dedicated to signatures of natural selection in modern humans. Nucleic Acids Res. 42, D903–D909. Pybus M, et al. 2015. Hierarchical boosting: a machine-learning framework to detect and classify hard selective sweeps in human populations. Bioinformatics, 31(24):3946–3952. Reich Det al. , 2011. Denisova admixture and the first modern human dispersals into Southeast Asia and Oceania. Am J Hum Genet . 89( 4): 516– 528. Google Scholar CrossRef Search ADS PubMed Ryu H, Hill DA, Furuichi T. 2015. Prolonged maximal sexual swelling in wild bonobos facilitates affiliative interactions between females. Behaviour 152( 3–4): 285– 311. Google Scholar CrossRef Search ADS Sankararaman Set al. , 2014. The genomic landscape of Neanderthal ancestry in present-day humans. Nature 507( 7492): 354– 357. Google Scholar CrossRef Search ADS PubMed Sankararaman S, Mallick S, Patterson N, Reich D. 2016. The combined landscape of Denisovan and Neanderthal ancestry in present-day humans. Curr Biol . 26( 9): 1241– 1247. Google Scholar CrossRef Search ADS PubMed Smith FH, Jankovic I, Karavanic I. 2005. The assimilation model, modern human origins in Europe, and the extinction of Neandertals. Quant Int . 137( 1): 7– 19. Song Yet al. , 2011. Adaptive introgression of anticoagulant rodent poison resistance by hybridization between old world mice. Curr Biol . 21( 15): 1296– 1301. Google Scholar CrossRef Search ADS PubMed Sun Cet al. , 2011. A signature of balancing selection in the region upstream to the human UGT2B4 gene and implications for breast cancer risk. Hum Genet . 130( 6): 767– 775. Google Scholar CrossRef Search ADS PubMed Tajima F. 1989. Statistical-method for testing the neutral mutation hypothesis by DNA polymorphism. Genetics 123( 3): 585– 595. Google Scholar PubMed Vernot Bet al. , 2016. Excavating Neandertal and Denisovan DNA from the genomes of Melanesian individuals. Science 352( 6282): 235– 239. Google Scholar CrossRef Search ADS PubMed Whitney KD, Randell RA, Rieseberg LH. 2006. Adaptive introgression of herbivore resistance traits in the weedy sunflower Helianthus annuus. Am Nat . 167( 6): 794– 807. Google Scholar CrossRef Search ADS PubMed Wilson JNet al. , 2006. A hallmark of balancing selection is present at the promoter region of interleukin 10. Genes Immun . 7( 8): 680– 683. Google Scholar CrossRef Search ADS PubMed Wobber V, Lipson S, Hare B, Wrangham R, Ellison P. 2012. Species differences in the ontogeny of testosterone production between chimpanzees and bonobos. Am J Phys Anthropol . 147: 305– 306. © The Author(s) 2018. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact [email protected]
Abstract Background Flaviviruses, which include Dengue (DV) and West Nile (WN), mutate in response to immune system pressure. Identifying escape mutants, variant progeny that replicate in the presence of neutralizing antibodies, is a common way to identify functionally important residues of viral proteins. However, the mutations typically occur at variable positions on the viral surface that are not essential for viral replication. Methods are needed to determine the true targets of the neutralizing antibodies. Results Stereophysicochemical variability plots (SVPs), 3-D images of protein structures colored according to variability, as determined by our PCPMer program, were used to visualize residues conserved in their p hysical c hemical p roperties (PCPs) near escape mutant positions. The analysis showed 1) that escape mutations in the flavivirus envelope protein are variable residues by our criteria and 2) two escape mutants found at the same position in many flaviviruses sit above clusters of conserved residues from different regions of the linear sequence. Conservation patterns in T-cell epitopes in the NS3- protease suggest a similar mechanism of immune system evasion. Conclusion The SVPs add another dimension to structurally defining the binding sites of neutralizing antibodies. They provide a useful aid for determining antigenically important regions and designing vaccines.
Infection models with animals whose immune systems have been selectively altered by neutralization of endogenous cytokines or by deletion of a gene have provided a valuable means to study the function of cells or cytokines in the context of complex multidimensional interactions. In particular, knockout mice have allowed a deeper insight into thein vivo performance of antifungal innate and acquired immunity, whose interplay is considered fundamental in the general defense against infections. It is conceivable that such an integrated view of effector and regulatory immune mechanisms operating in opportunistic fungal infections would facilitate the search for cells, cytokines and molecular pathways that are essential to control fungal infectivity or oppose fungus-associated immunopathology.
Inherited genetic susceptibility to multiple myeloma has been investigated in a number of studies. Although 23 individual risk loci have been identified, much of the genetic heritability remains unknown. Here we carried out genome-wide interaction analyses on two European cohorts accounting for 3,999 cases and 7,266 controls and characterized genetic susceptibility to multiple myeloma with subsequent meta-analysis that discovered 16 unique interacting loci. These risk loci along with previously known variants explain 17% of the heritability in liability scale. The genes associated with the interacting loci were found to be enriched in transforming growth factor beta signaling and circadian rhythm regulation pathways suggesting immunoglobulin trait modulation, TH17 cell differentiation and bone morphogenesis as mechanistic links between the predisposition markers and intrinsic multiple myeloma biology. Further tissue/cell-type enrichment analysis associated the discovered genes with hemic-immune system tissue types and immune-related cell types indicating overall involvement in immune response.
Non-viral in vivo administration of plasmid DNA for vaccines and immunotherapeutics has been hampered by inefficient delivery. Methods to enhance delivery such as in vivo electroporation (EP) have demonstrated effectiveness in circumventing this difficulty. However, the contact-dependent nature of EP has resulting side effects in animals and humans. Noncontact delivery methods should, in principle, overcome some of these obstacles. This report describes a helium plasma–based delivery system that enhanced humoral and cellular antigen-specific immune responses in mice against an intradermally administered HIV gp120-expressing plasmid vaccine (pJRFLgp120). The most efficient plasma delivery parameters investigated resulted in the generation of geometric mean antibody-binding titers that were 19-fold higher than plasmid delivery alone. Plasma mediated delivery of pJRFLgp120 also resulted in a 17-fold increase in the number of interferon-gamma spot-forming cells, a measure of CD8+ cytotoxic T cells, compared with non-facilitated plasmid delivery. This is the first report demonstrating the ability of this contact-independent delivery method to enhance antigen-specific immune responses against a protein generated by a DNA vaccine.
Background Pediatric patients undergoing invasive operations bear extra risk of developing nosocomial infections (NIs). However, epidemiological evidence of the underlying risk factors, which is needed for early prevention, remains limited. Methods Using data from the electronic medical records and the NI reporting system of a tertiary pediatric hospital, we conducted a retrospective analysis to identify preoperative and operation-related risk factors for postoperative NIs. Multivariable accelerated failure time models were fitted to select independent risk factors. The performance of these factors in risk stratification was examined by comparing the empirical risks between the model-defined low- and high-risk groups. Results A total of 18,314 children undergoing invasive operations were included for analysis. After a follow-up period of 154,700 patient-days, 847 postoperative NIs were diagnosed. The highest postoperative NI rate was observed for operations on hemic and lymphatic system. Surgical site infections were the NI type showing the highest overall risk; however, patients were more likely to develop urinary tract infections in the first postoperative week. Older age, higher weight-for-height z-score, longer preoperative ICU stay, preoperative enteral nutrition, same-day antibiotic prophylaxis, and higher hemoglobin level were associated with delayed occurrence of postoperative NIs, while longer preoperative hospitalization, longer operative duration, and higher American Society of Anesthesiologists score showed acceleration effects. Risk stratification based on these factors in an independent patient population was moderate, resulting in a high-risk group in which 72% of the postoperative NIs were included. Conclusions Our findings suggest that pediatric patients undergoing invasive operations and at high risk of developing postoperative NIs are likely to be identified using basic preoperative and operation-related risk factors, which together might lead to moderately accurate risk stratification but still provide valuable information to guide early and judicious prevention. Introduction Nosocomial infections (NIs) pose a long-standing challenge to clinical practitioners and remain one of the leading causes of in-hospital mortality [1]. The overall prevalence of NIs varies from 7% in affluent countries to 15% in economically developing countries [2, 3], whereas the most common types of NIs are invariably surgical site infections (SSIs) and device-associated infections [2, 4, 5], suggesting that most of the NIs are attributable to invasive operations. Pediatric patients undergoing invasive operations face extra risk of developing NIs because of their underdeveloped immune system. According to two European studies, the NI incidence was 2.5% in general pediatric wards and was 17% in surgical wards [6, 7]. From a prevention perspective, by recognizing the preoperative and operation-related factors that enhance patients’ susceptibility to postoperative NIs, healthcare providers would be able to identify vulnerable patients for closer observation and to initiate timely prophylactic treatment when necessary. For adult surgical patients, several epidemiological studies have revealed a wide variety of such factors [8–10], but detailed research in pediatric patients is scarce. In the present study, we focused on pediatric patients undergoing invasive operations and aimed to identify preoperative and operation-related risk factors for the occurrence of postoperative NIs. Methods Study setting and design The present study was a hospital-based retrospective cohort study conducted in a tertiary referral hospital—Guangzhou Women and Children’s Medical Center—in Guangzhou, China. The data source of this study was the electronic medical records (EMRs) of the pediatric inpatients who underwent invasive procedures between 2016 and 2018. Extraction of data and ascertainment of study outcomes Clinical data were derived from the EMRs and then linked to the hospital’s NI reporting system via patient identification numbers. In the EMRs, operative procedures were documented in both text and the procedural codes defined by the International Classification of Diseases, 9th Revision, Clinical Modification (ICD-9-CM). According to the criteria of the NI reporting system, an infection was considered nosocomial if it occurred > 48 hours after admission. Neonatal infections were considered nosocomial as well if they were acquired during delivery. Diagnoses of specific NI types were made following the criteria from the Centers for Disease Control and Prevention/National Healthcare Safety Network [11]. The outcome of interest in the present study was NI diagnosed after invasive operation. For patients with multiple invasive operations and/or multiple postoperative NI episodes, only the first invasive operation and the first postoperative NI were analyzed. The patient cohort In the EMR database, we identified 18,314 patients who underwent invasive operations between 2016 and 2018 on one of the following specific systems: the nervous system (ICD-9-CM code: 01–05), respiratory system (ICD-9-CM code: 30–34), cardiovascular system (ICD-9-CM code: 35–39), hemic and lymphatic system (ICD-9-CM code: 40–41), digestive system (ICD-9-CM code: 42–54), urinary system (ICD-9-CM code: 55–59), and musculoskeletal system (ICD-9-CM code: 76–84). The main reason for us to focus on these systems was that the invasive operations performed on these systems except those on the hemic and lymphatic system were the most common ones in our institution. Invasive operations on the hemic and lymphatic system were relatively rare but were included because of the high postoperative NI rate after hemic and lymphatic surgeries. Ethical considerations This study was approved by the ethics committee of the Guangzhou Women and Children’s Medical Center (2019–13600). This study utilized historical data collected during routine clinical practice and the findings of this study will be used only for academic activities; therefore requirement for informed consent from patients was waived. Statistical analysis The following candidate risk factors were considered: sex; age, nutritional status measured with weight-for-age z-score (WAZ), and blood test result at admission; lengths of preoperative hospitalization and intensive care unit (ICU) stay; preoperative enteral nutrition (EN) and parenteral nutrition (PN) support; operative duration; surgical implantation; antibiotic prophylaxis; the American Society of Anesthesiologists (ASA) score; surgical wound classification (SWC); and ICD-9-CM code. For patients younger than 10 years of age, WAZ was calculated using the WHO growth standards as the reference. For patients above that age, WAZ is not an appropriate measure of nutritional status and thus was treated as missing. Antibiotic prophylaxis was defined as use of antibiotics on the same day as the operation was performed. In order to retain the patients with missing values in analysis, binary indicators were created to denote data incompleteness, which was the case for WAZ, operative duration, and ASA score. Binary indicators were also created for preoperative ICU, blood test at admission, and surgical incision to distinguish between patients who did and who did not receive these procedures (see Table A in S1 File for a detailed description). Time to event was defined as the duration from date of operation to date of NI diagnosis or date of discharge, whichever came first. After confirming that the log-transformed time to event was approximately normally distributed, accelerated failure time (AFT) models with log-normal distribution were fitted. In log-normal AFT models, the exponentiated coefficient (eβ) of a factor is interpreted as time ratio (TR) indicating whether this factor would decelerate (eβ >1) or accelerate (eβ <1) the occurrence of the event. The TR estimates can be converted into the number of days by which the time between operation and occurrence of postoperative NIs could be prolonged or shortened using the following formula: ΔT = Tref {exp[βx(x-xref)] −1}, where Tref denotes the time between the operation and the occurrence of postoperative NI for a reference patient, and Xref denotes the risk factor profile of the reference patient. A reduced model was achieved using backward elimination with a threshold P-value of 0.1. Indicator variable and the variables related to it were handled as a block during variable selection; therefore we chose a less stringent P-value to decrease the possibility of excluding blocks that contain statistically significant risk factors. In order to control for the clinical heterogeneity of the operations, we predetermined that ICD-9-CM code should be included in the models regardless of the result of variable selection. We also built multivariable Cox regression models to estimate the hazard ratios (HRs), which are more commonly used to describe the strengths of risk associations. In order to examine the clinical applicability of the identified risk factors for risk stratification purpose, we derived a postoperative NI risk score for another data set of 4,383 pediatric patients who underwent invasive operations between January 2019 and May 2019, of them 110 NIs were diagnosed postoperatively. The postoperative NI risk score was calculated by summing up the risk factors weighted by the AFT model coefficients and used the median of this risk score as a cutoff to divide the patients into low- and high-risk groups (Risk score calculation in S1 File). We compared the two groups regarding their empirical postoperative NI risks, which were estimated using the Nelson-Aalen method [12]. All the statistical tests were two-sided with P < 0.05 considered statistically significant. All the statistical analyses were performed using the “survival” package in R (R Foundation for Statistical Computing, Vienna, Austria). Study setting and design The present study was a hospital-based retrospective cohort study conducted in a tertiary referral hospital—Guangzhou Women and Children’s Medical Center—in Guangzhou, China. The data source of this study was the electronic medical records (EMRs) of the pediatric inpatients who underwent invasive procedures between 2016 and 2018. Extraction of data and ascertainment of study outcomes Clinical data were derived from the EMRs and then linked to the hospital’s NI reporting system via patient identification numbers. In the EMRs, operative procedures were documented in both text and the procedural codes defined by the International Classification of Diseases, 9th Revision, Clinical Modification (ICD-9-CM). According to the criteria of the NI reporting system, an infection was considered nosocomial if it occurred > 48 hours after admission. Neonatal infections were considered nosocomial as well if they were acquired during delivery. Diagnoses of specific NI types were made following the criteria from the Centers for Disease Control and Prevention/National Healthcare Safety Network [11]. The outcome of interest in the present study was NI diagnosed after invasive operation. For patients with multiple invasive operations and/or multiple postoperative NI episodes, only the first invasive operation and the first postoperative NI were analyzed. The patient cohort In the EMR database, we identified 18,314 patients who underwent invasive operations between 2016 and 2018 on one of the following specific systems: the nervous system (ICD-9-CM code: 01–05), respiratory system (ICD-9-CM code: 30–34), cardiovascular system (ICD-9-CM code: 35–39), hemic and lymphatic system (ICD-9-CM code: 40–41), digestive system (ICD-9-CM code: 42–54), urinary system (ICD-9-CM code: 55–59), and musculoskeletal system (ICD-9-CM code: 76–84). The main reason for us to focus on these systems was that the invasive operations performed on these systems except those on the hemic and lymphatic system were the most common ones in our institution. Invasive operations on the hemic and lymphatic system were relatively rare but were included because of the high postoperative NI rate after hemic and lymphatic surgeries. Ethical considerations This study was approved by the ethics committee of the Guangzhou Women and Children’s Medical Center (2019–13600). This study utilized historical data collected during routine clinical practice and the findings of this study will be used only for academic activities; therefore requirement for informed consent from patients was waived. Statistical analysis The following candidate risk factors were considered: sex; age, nutritional status measured with weight-for-age z-score (WAZ), and blood test result at admission; lengths of preoperative hospitalization and intensive care unit (ICU) stay; preoperative enteral nutrition (EN) and parenteral nutrition (PN) support; operative duration; surgical implantation; antibiotic prophylaxis; the American Society of Anesthesiologists (ASA) score; surgical wound classification (SWC); and ICD-9-CM code. For patients younger than 10 years of age, WAZ was calculated using the WHO growth standards as the reference. For patients above that age, WAZ is not an appropriate measure of nutritional status and thus was treated as missing. Antibiotic prophylaxis was defined as use of antibiotics on the same day as the operation was performed. In order to retain the patients with missing values in analysis, binary indicators were created to denote data incompleteness, which was the case for WAZ, operative duration, and ASA score. Binary indicators were also created for preoperative ICU, blood test at admission, and surgical incision to distinguish between patients who did and who did not receive these procedures (see Table A in S1 File for a detailed description). Time to event was defined as the duration from date of operation to date of NI diagnosis or date of discharge, whichever came first. After confirming that the log-transformed time to event was approximately normally distributed, accelerated failure time (AFT) models with log-normal distribution were fitted. In log-normal AFT models, the exponentiated coefficient (eβ) of a factor is interpreted as time ratio (TR) indicating whether this factor would decelerate (eβ >1) or accelerate (eβ <1) the occurrence of the event. The TR estimates can be converted into the number of days by which the time between operation and occurrence of postoperative NIs could be prolonged or shortened using the following formula: ΔT = Tref {exp[βx(x-xref)] −1}, where Tref denotes the time between the operation and the occurrence of postoperative NI for a reference patient, and Xref denotes the risk factor profile of the reference patient. A reduced model was achieved using backward elimination with a threshold P-value of 0.1. Indicator variable and the variables related to it were handled as a block during variable selection; therefore we chose a less stringent P-value to decrease the possibility of excluding blocks that contain statistically significant risk factors. In order to control for the clinical heterogeneity of the operations, we predetermined that ICD-9-CM code should be included in the models regardless of the result of variable selection. We also built multivariable Cox regression models to estimate the hazard ratios (HRs), which are more commonly used to describe the strengths of risk associations. In order to examine the clinical applicability of the identified risk factors for risk stratification purpose, we derived a postoperative NI risk score for another data set of 4,383 pediatric patients who underwent invasive operations between January 2019 and May 2019, of them 110 NIs were diagnosed postoperatively. The postoperative NI risk score was calculated by summing up the risk factors weighted by the AFT model coefficients and used the median of this risk score as a cutoff to divide the patients into low- and high-risk groups (Risk score calculation in S1 File). We compared the two groups regarding their empirical postoperative NI risks, which were estimated using the Nelson-Aalen method [12]. All the statistical tests were two-sided with P < 0.05 considered statistically significant. All the statistical analyses were performed using the “survival” package in R (R Foundation for Statistical Computing, Vienna, Austria). Results Baseline characteristics of the cohort, stratified by the subsequent NI status, are shown in Table 1. Compared with the patients who did not develop postoperative NIs, those who did were relatively younger, had a lower WAZ and longer preoperative hospitalization, and were more likely to have preoperative ICU stay and to receive preoperative nutrition support. The NI group had on average a lower hemoglobin level. Regarding operation-related characteristics, patients in the NI group had an averagely longer operative duration and were more likely to receive surgical implantation and to have higher ASA scores. Download: PPT PowerPoint slide PNG larger image TIFF original image Table 1. Baseline characteristics of a retrospective cohort of pediatric patients who underwent invasive operations (n = 18,314), stratified by postoperative NI status, the Guangzhou Women and Children’s Medical Center, 2016–2018. https://doi.org/10.1371/journal.pone.0225607.t001 During a follow-up of 154,700 patient-days, 847 postoperative NIs were diagnosed, yielding an incidence rate of 5.5 per 1000 PDs. Overall, operations on hemic and lymphatic system were followed by the highest NI rate (11.0 per 1,000 PDs), and SSIs were the most common NI type (Table 2). However, when confined to the first postoperative week, the highest NI risk was observed for cardiovascular operations (Fig 1A), and the NI type with the highest risk was urinary tract infections (UTIs, Fig 1B). Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 1. Risk of developing NIs in the first postoperative week, stratified by operative site (plot A) and by infection site (plot B). BSI (bloodstream infection), GI (gastrointestinal infection), LRI (lower respiratory tract infection), NI (nosocomial infection), SSI (surgical site infection), URI (upper respiratory tract infection, (UTI) urinary tract infection. https://doi.org/10.1371/journal.pone.0225607.g001 Download: PPT PowerPoint slide PNG larger image TIFF original image Table 2. Incidence of major NIs in a retrospective pediatric patient cohort (n = 18,314) after invasive operations, by operative site (ICD-9-CM) and by NI type, the Guangzhou Women and Children’s Medical Center, 2016–2018. https://doi.org/10.1371/journal.pone.0225607.t002 In the full multivariable AFT model, older age, higher WAZ, longer preoperative ICU stay, preoperative EN, antibiotic prophylaxis, and higher hemoglobin level were statistically significantly associated with delayed occurrence of postoperative NIs (Table 3). Longer preoperative hospitalization, blood test at admission, and longer operative duration were statistically significantly associated with advanced occurrence of the events. Backward elimination led to a reduced model including all the prior statistically significant risk factors as well as high ASA score (≥ III). According to the reduced AFT model and given a reference patient who was defined as follows: underwent an invasive operation on the hemic and lymphatic system, age = 12 months, WAZ = −3, days of preoperative hospitalization = 7, preoperative ICU = 0, preoperative EN = 0, preoperative antibiotic prophylaxis = 0, hemoglobin = 100g/L, WBC = 10×109/L, operative duration = 120 minutes, and ASA score = II, the estimated time between invasive operation and the occurrence of postoperative NIs was 56 days. Therefore, antibiotic prophylaxis = 1, preoperative EN = 1, and one year older in age would extend this time by approximately 24, 14, and 4 days, respectively, for otherwise comparable patients; whereas preoperative hospital stay for one more week, one-point increase in ASA score, and one-hour increase in operation duration would advance the occurrence of the postoperative NIs by 11, 10 and 5 days, respectively. Download: PPT PowerPoint slide PNG larger image TIFF original image Table 3. Associations between baseline factors and the development of postoperative NIs in multivariable AFT log-normal models, the Guangzhou Women and Children’s Medical Center, 2016–2018. https://doi.org/10.1371/journal.pone.0225607.t003 The multivariable Cox model (Table B in S1 File) yielded largely consistent results except for the statistically significant positive association for WBC. In the reduced Cox model, the proportional hazards assumption was not met for age at operation, antibiotic prophylaxis, and operative duration. Risk stratification based on the reduced AFT model was applied to the patients who underwent invasive operations between January 2019 and May 2019 (n = 4,383). Fig 2 shows the empirical postoperative NI risks in the resulting low- and high-risk groups. Within the first 30 days after operation, the high-risk group (2,191 patients including 79 NIs) showed a consistently higher postoperative NI risk than the low-risk group (2,192 patients including 31 NIs). The 30-day postoperative NI rates were 3.1 and 5.4 per 1,000 PDs for the two groups, respectively, and 72% of the postoperative NIs occurred in the high-risk group. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 2. Postoperative NI risk in the low-risk group (2,192 patients including 21 NIs) and the high-risk group (2,191 patients including 79 NIs), stratified using the median of the risk score derived from the reduced AFT modela. aAnalysis was done in an independent data set of patients undergoing invasive operation between January and May, 2019 (n = 4,383). NI (nosocomial infection). https://doi.org/10.1371/journal.pone.0225607.g002 Discussion In this retrospective cohort of pediatric inpatients undergoing invasive operations, older age, higher WAZ, longer ICU stay, preoperative EN support, antibiotic prophylaxis, and higher hemoglobin level were associated with delayed occurrence of postoperative NIs, while longer preoperative hospitalization, longer operative duration, and higher ASA score might accelerate the occurrence of the postoperative NIs. Risk stratification based on these factors in the same cohort was moderately accurate. In this study, NI rate following hemic and lymphatic operations was the highest, probably due to immunodeficiency commonly seen in patients with hematologic diseases. In two studies including one in pediatric surgical patients, the most frequent NI types were SSI [9, 13], -consistent with our results. In the present study, however, the risk of developing UTIs within the first postoperative week was higher than the risks of developing other types of NIs, suggesting that the pathogens causing UTIs might have a relatively short incubation period. Malnutrition increases the risk of infection by impairing the immune system [14, 15], accounting for the inverse association between WAZ and postoperative NI risk in this study. Increased NI risk in relation to malnutrition has been reported in prior studies, where malnutrition was defined using different measures such as low birth weight and low body mass index [16, 17]. As a biomarker of malnutrition, low hemoglobin level may compromise non-specific immunity and increase susceptibility to infection [18], which explains its inverse association with postoperative NI risk in our cohort. For the reference patient we defined, we found that one-day increase in ICU stay before operation was associated with a delay of approximately 2 days in the occurrence of postoperative NIs: this was likely to result from the benefits of intensive care, such as close observation and nutrition support. Previous studies reported a positive association between length of ICU stay and NI risk among ICU patients [19, 20], which however was not surprising given the fact that the risk of developing NIs in ICU always accrues as the ICU stay extends. Furthermore, those studies were flawed by reverse causality (i.e. occurrence of NIs in ICU extends ICU stay) and their problematic use of logistic regression to analyze time-to-event data. In our cohort, longer preoperative hospitalization was associated with an increased postoperative NI risk: one-week longer preoperative hospital stay would accelerate the occurrence of postoperative NIs by 11 days for the reference patient, in line with a previous study in adult surgical patients [21]. Length of preoperative hospitalization is largely a proxy of the severity and complexity of the underlying disease. Moreover, prolonged preoperative hospitalization means extended exposure to pathogens and enhanced possibility of developing hospital-acquired malnutrition [22, 23]: both increase patients’ susceptibility to infections. Current evidence regarding the association between EN and NI risk is inconclusive [24, 25]. Thus the inverse association between preoperative EN and postoperative NI risk in our cohort warrants confirmation by future studies. EN may cause NIs via contaminated feedings and/or tracheal colonization of gastric organisms; however, it may also reduce infectious complications by maintaining gut structure [26]. It has been well proved that PN increases the risk of BSI because of CVC involvement [27, 28]. In the present study, only a small fraction of PN (<2%) was administered via CVC, which might partly explain the low incidence of BSIs in the present study as well as the null association between PN and the overall NI risk. Reduced SSI rates due to antibiotic prophylaxis has been reported [29], supporting our finding that antibiotic prophylaxis might delay the occurrence of postoperative NIs by more than 20 days. However, increasing evidence also suggests that antibiotic prophylaxis will cause unnecessary side effects if it is given without indication [30]. In our cohort, antibiotic prophylaxis was common, but lack of detailed data did not allow us to define antibiotic prophylaxis following the established protocols [31, 32] or to determine its necessity and appropriateness in terms of type, timing, and dosage. The present study supports the finding of a prior study that higher ASA scores might raise the risks of specific NI types including SSI, UTI, and nosocomial pneumonia [9]. Our results also confirm the existing evidence that prolonged operative duration might increase the risks of postoperative SSIs and other complications [33]. As for SWC, the present study only had a low number of patients with contaminated wounds and had none with dirty wounds. It has also been suggested that surgical wounds might be substantially misclassified in clinical practice [34], making it even more difficult to compare our result with those from previous studies, where a positive association between SWC and the overall postoperative NI risk might exist given the increased SSI risk among patients with contaminated or dirty wounds [35, 36]. Epidemiologic studies on postoperative NI incidence and its risk factors are inadequate, especially for pediatric patients; therefore, our findings would enrich the existing knowledge and would inform healthcare providers of a patient’s risk of developing postoperative NIs even before invasive operations. As suggested by our results, a risk score based on basic preoperative and operation-related risk factors might lead to moderately accurate risk stratification. However, using such a risk score would be rewarded with early identification of vulnerable patients, timely prevention, and cautious selection of the postoperative treatment or procedures that might further increase the NI risk. The strengths of the present study include its relatively large sample size, a variety of possible risk factors to be considered, and proper statistical analysis. From a methodological perspective, despite the largely consistent results, the AFT model was superior to the Cox model in our case, given that the proportional hazards hypothesis could be legitimately violated: for example, the effect of antibiotic prophylaxis will diminish rather than remaining constant over time. Several limitations of the present study should be noted. First, like many other hospital-based studies, this study was subject to selection bias in particular the referral bias, which might limit the generalizability of our results. Second, we were concerned that using ICD-9-CM code at its lowest level of specificity might not be sufficient to control for the invasiveness of the operation and the severity of the underlying disease. Third, several putative risk factors for NIs were either precluded from our analysis or could not be studied thoroughly due to lack of data, such as use of antibiotics and immunosuppressants. Finally, we could not include the NIs that occurred after discharge as no systematic post-hospital follow-up was performed to collect the data. Conclusions Our findings suggest that pediatric patients undergoing invasive operations and at high risk of developing postoperative NIs are likely to be identified using basic preoperative and operation-related risk factors, which together might lead to moderately accurate risk stratification but still provide valuable information to guide early and judicious prevention. Supporting information S1 File. https://doi.org/10.1371/journal.pone.0225607.s001 (DOC)
Abstract Sera from patients with connective tissue diseases were tested for antibody against the Epstein-Barr virus (EBV) by indirect immunofluorescence and immunodiffusion. There were no significant differences between patient groups and controls in percent seropositive or titer by immunofluorescence. With immunodiffusion against EBV antigens, sera from patients with systemic lupus erythematosus (SLE) had a greater mean number of bands than controls and a miscellaneous group, though not the other patient groups. Many bands apparently unrelated to EBV were detected. A new pattern of nuclear fluorescence was seen which, when present, appeared in 75% to 85% of cells. This consisted of crossing strands of fluorescence confined to the nucleus, in contrast to the homogenous cellular fluorescence seen with EBV-specific reactions. Recognizing antibody activity not specific to EBV infection will be useful to workers employing these tests in future seroepidemiologic studies. References 1. Henle W: Evidence for viruses in acute leukemia and Burkitt's tumor. Cancer 21:580-586, 1968.Crossref 2. Rauscher FJ: Virologic studies in human leukemia and lymphoma: the herpes-type virus. Cancer Res 28:1311-1318, 1968. 3. Niederman JC, McCollum RW, Henle G, et al: Infectious mononucleosis: Clinical manifestations in relation to EB virus antibodies. JAMA 203:205-209, 1968.Crossref 4. Henle G, Henle W, Diehl V: Relation of Burkitt's tumor-associated herpes-type virus to infectious mononucleosis. Proc Nat Acad Sci USA 59:94-101, 1968.Crossref 5. Gerber P, Hamre D, Moy RA, et al: Infectious mononucleosis: Complement-fixing antibodies to herpes-like virus associated with Burkitt's lymphoma. Science 161:173-175, 1968.Crossref 6. Old LJ, Boyse EA, Geering G, et al: Serologic approaches to the study of cancer in animals and in man. Cancer Res 28:1288-1299, 1968. 7. Hirshaut Y, Glade P, Vieira LOBD, et al: Sarcoidosis, another disease associated with serologic evidence for herpes-like virus infection. New Eng J Med 283:502-506, 1970.Crossref 8. Oleinick A: Leukemia or lymphoma occurring subsequent to an autoimmune disease. Blood 29:144-153, 1967. 9. Miller DG: The association of immune disease and malignant lymphoma. Ann Intern Med 66:507-531, 1967.Crossref 10. Finch SC: Infectious mononucleosis , in Samter M (ed): Immunological Diseases. Boston, Little Brown & Co, 1965, pp 394-405. 11. Chessin LN, Glade PR, Kasel JA, et al: The circulating lymphocyte—Its role in infectious mononucleosis. Ann Intern Med 69:333-359, 1968.Crossref 12. Kaplan ME, Tan EM: Antinuclear antibodies in infectious mononucleosis. Lancet 1:561-563, 1968.Crossref 13. Carter RL, Penman HG: Infectious mononucleosis. London, Medical Books Ltd, 1969. 14. Committee of the American Rheumatism Association: Primer on the rheumatic diseases. JAMA 190:741-751,1964.Crossref 15. Dorsch CA, Gibbs CB, Stevens MB, et al: Significance of nuclear immunofluorescent patterns. Ann Rheum Dis 28:313-319, 1969.Crossref 16. Stevens DA, Pry TW, Blackham EA: Prevalence of precipitating antibody to antigens derived from Burkitt lymphoma cultures infected with herpes-type virus (EB virus). Blood 35:263-275, 1970. 17. Henle G, Henle W: Immunofluorescence in cells derived from Burkitt's lymphoma. J Bact 91:1248-1256, 1966. 18. Stevens DA, Pry TW, Manaker RA: Infectious mononucleosis: Always a primary infection with herpes-type virus? J Nat Cancer Inst 44:533-537, 1970. 19. Zur Hausen H, Henle W, Hummeler K, et al: Comparative study of cultured Burkitt's tumor cells by immunofluorescence, autoradiography and electron microscopy. J Virol 1:830-837, 1967. 20. Epstein MA, Achong BG: Specific immunofluorescence test for the herpes-type EB virus of Burkitt lymphoblasts, authenticated by electron microscopy. J Nat Cancer Inst 40:593-607, 1968. 21. Hinuma Y, Konn M, Yamaguchi J, et al: Replication of herpes-type virus in a Burkitt lymphoma cell line. J Virol 1:1203-1206, 1967. 22. Hinuma Y, Konn M, Yamaguchi J, et al: Immunofluorescence and herpes-type virus particles in the P3HR-1 Burkitt lymphoma cell line. J Virol 1:1045-1051, 1967. 23. Stevens DA, Pry TW, Blackham EA, et al: Immunodiffusion studies of EB virus (Herpes-type virus) infected and uninfected hemic cell lines. Int J Cancer 5:229-237, 1970.Crossref 24. McDuffie FC: Twenty years of the lupus erythematosus cell. Ann Intern Med 70:413-417, 1969.Crossref 25. Burnham TK, Neblett TR, Fine G: The immunofluorescent tumor imprint technique: III. The diagnostic and prognostic significance of the "speckle"—inducing antinuclear antibody. Amer J Clin Path 50:683-688, 1968. 26. Casals SP, Friou GJ, Teague PO: Specific nuclear reaction pattern of antibody to DNA in lupus erythematosus sera. J Lab Clin Med 62:625-631, 1963. 27. Tan EM: Relationship of nuclear staining patterns with precipitating antibodies in systemic lupus erythematosus. J Lab Clin Med 70:800-812, 1967. 28. Svec KH: Immunologic and clinical observations of granulocyte-specific antinuclear antibodies. Arthritis Rheum 12:165-172, 1969.Crossref 29. Stevens DA, Pry TW, Blackham EA, et al: Comparison of antigens from human and chimpanzee herpestype virus-infected hemic cell lines. Proc Soc Exp Biol Med 133:678-683, 1970.Crossref 30. Stevens DA, Kottaridis SD, Luginbuhl RE: Investigation of antigenic relationship of Marek's disease herpes virus and EB virus (herpes-type virus). J Comp Path 81:137-140, 1971.Crossref 31. Schur PH: A.N.A. New England J Med 282:1205-1206, 1970.Crossref 32. Dalldorf G, Carvalho RPS, Jamra M, et al: The lymphomas of Brazilian children. JAMA 208:1365-1368, 1969.Crossref 33. Evans AS, Rothfield N, Niederman JC: Raised antibody titers to EB virus in systemic lupus erythematosus. Lancet 1:167-168, 1971.Crossref 34. Holman HR: Systemic lupus erythematosus: Disease of unusual immunologic responsiveness? Amer J Med 27:525-528, 1959.Crossref 35. Phillips P, Christian CL: Myxovirus antibody increases in human connective tissue disease. Science 168:982-984, 1970.Crossref
Abstract • A retrospective study of 112 patients with abdominal aortic aneurysm (AAA) and 232 with arteriosclerosis obliterans (ASO) demonstrated that 16.7% of those with AAA and 1.6% of those with ASO had a history of gastroduodenal ulcer; 83% of these lesions with AAA were gastric ulcers. Ulcer bleeding after vascular reconstruction developed in seven patients with AAA and one with ASO. Serum fibrinogen levels and platelet counts were significantly lower in patients with AAA than in those with ASO or controls. A prospective study showed that 25 (52.1%) of 48 patients with AAA and six (20.0%) of 30 patients with ASO had endoscopically proved gastroduodenal lesions before vascular reconstruction. None of them developed postoperative bleeding after treatment of both their gastroduodenal lesions and coagulopathy. Furthermore, the significant decrease in blood flow and prostaglandin content of gastric mucosa was demonstrated in patients with AAA. References 1. Jones AW, Kirk RS, Bloor K. The association between aneurysm of the abdominal aorta and peptic ulceration . Gut . 1970;11:679-684.Crossref 2. Bouhoutsos J, Barabas A, Martin P. The association of peptic ulcer and abdominal aortic aneurysm and its significance . Br J Surg . 1973;60:302-304.Crossref 3. Mulcare RJ, Royster TS, Weiss HJ, Phillips LL. Disseminated intravascular coagulation as a complication of abdominal aortic aneurysm repair . Ann Surg . 1974;180:343-349.Crossref 4. Diskin CJ, Weitberg AB. Minidose heparin therapy: treatment of chronic intravascular coagulation syndrome . Arch Intern Med . 1980;140:263-266.Crossref 5. Satatini B, Savrin R, Evance WE. Consumption coagulopathy associated with arterial aneurysms . J Cardiovasc Surg . 1979;20:273-278. 6. Elkeles A. Gastric ulcer in the aged and calcified atherosclerosis . AJR Am J Roentgenol . 1964;91:744-750. 7. Wesolowski SA, Fries CC, Sabini AM, Sawyer PN. The significance of turbulence in hemic system and in the distribution of the atherosclerotic lesion . Surgery . 1965;57:155-161. 8. Darke SG, Glasgow MS, Eadie DGA. Studies on maximal acid output in patients with abdominal aortic aneurysms . J Cardiovasc Surg . 1977;18:471-474. 9. Ritchie WP Jr. Acute gastric damage induced by bile salts, acid, and ischemia . Gastroenterology . 1975;68:699-707. 10. Euler AU, Popiela T, Tytgat GN, et al. A multiclinic trial evaluating arbaprostil (15R-15-methyl prostaglandin E2) as a therapeutic agent for gastric ulcer . Gastroenterology . 1989;96:967-971. 11. Major JS, Scholes P. The localization of a histamine H-receptor adenylate cyclase system in canine parietal cells and its inhibition by prostaglandin . Agents Actions . 1983;8:324-331.Crossref 12. Soll AH. Specific inhibition by prostaglandin E2 and I2 of histamine-stimulated 14C-aminopyrine accumulation and cyclic adenosine monophosphate generation by isolation canine parietal cells . J Clin Invest . 1980;65:1222-1229.Crossref 13. Seidler U, Beinborn M, Sewing K-F. Inhibition of acid formation in rabbit parietal cells by prostaglandin is mediated by the prostaglandin E2 receptor . Gastroenterology . 1989;96:314-320. 14. Redfern JS, Blair AJ, Lee E, Feldman M. Gastrointestinal ulcer formation in rabbits immunized with prostaglandin E2 . Gastroenterology . 1987;93:744-752. 15. Fisher DF, Yawn DH, Crawford ES. Preoperative disseminated intravascular coagulation associated with aortic aneurysms . Arch Surg . 1983;118:1252-1255.Crossref 16. Kaneko H, Sakaguchi S. Aneurysm repair and consumption coagulopathy: the importance of fibrinogen level . Jpn J Cardiovasc Surg . 1986;16:58-59.Crossref
Canine Transmissible Venereal Tumor The first transmissible cancer to be identified was canine transmissible venereal tumor (CTVT), a solid tumor that spreads within populations of dogs through sexual contact. This was identified as a transmissible disease and was first experimentally transplanted from dog to dog in 1876 [1]. The etiology of the disease was initially uncertain, but many successful transplant experiments, common karyotypic rearrangements [2], and the finding of a unique integration of a LINE1 retrotransposon in front of the c-myc gene in all cases of CTVT [3] suggested that transmission occurs through transfer and replication of the cancer cells themselves, rather than through viral modification of cells in each new host. Analysis of dog leukocyte antigen alleles, polymorphic microsatellite loci, and mitochondrial DNA confirmed that this cancer is spread as a clonal transmissible cancer lineage [4,5]. Another unexpected feature of CTVT is that it has repeatedly acquired new mitochondrial genomes from its hosts throughout its evolution [6,7]. This suggests that asexual replication of cancer cells may lead to a Muller’s ratchet [8] process in mitochondria, where accumulation of mutations make cancers unsustainable in the long term without reintroduction of competent mitochondrial genomes. CTVT spreads as a sexually transmitted infection in feral dog populations throughout the world, infecting animals in at least 90 countries on all continents except Antarctica [9]. An analysis of the CTVT genome concluded that the cells have been spreading as a transmissible cancer lineage for 10,000–12,000 years and likely arose from an early dog most closely related to Alaskan malamutes [10]. While experimentally transplanted CTVT usually regresses after a few months, naturally acquired disease does not always regress [2]. Thus, these cells have been evolving and spreading as an asexual parasitic organism that has outlived its original canine host by more than 500 generations. Tasmanian Devil Facial Tumor Disease The first observation of a Tasmanian devil suffering from devil facial tumor disease (DFTD) was in 1996 [11]. DFTD is a solid facial tumor that spreads from animal to animal through physical contact when devils bite each other. It was identified as a transmissible cancer when a unique karyotypic rearrangement was found in tumors of multiple animals [12] and was confirmed after sequencing of the genomes of two isolates of DFTD [13]. The fatal disease continues to spread through the devil population and threatens them with extinction, although sequestered insurance populations and ongoing research on possible treatments may help maintain the species [14]. In a recent report, a second, apparently completely independent, lineage of DFTD (termed DFT2) was identified in a small number of animals, suggesting that if the conditions allow, transmissible cancers may arise multiple times [15]. Interestingly, the first lineage of DFTD identified (now termed DFT1) arose from a female devil, but DFT2 arose from a male. The first five cases of DFT2 were found in males, so it is possible that females have some ability to recognize the male DFT2 cells as nonself, but the numbers of cases currently reported are too low to confirm this. Bivalve Transmissible Neoplasias Fatal leukemia-like neoplasias, called disseminated neoplasia or hemic neoplasia, have been reported in at least 15 different bivalve species [16,17]. They can occur at stable enzootic levels, but several epizootic events have been reported, including an outbreak of disease in soft-shell clams (Mya arenaria) on Prince Edward Island, Canada, where prevalences of >90% were recorded, and massive population loss was observed [17,18]. Analysis of the neoplastic cells in soft-shell clams revealed a dramatic amplification in the copy number of a retrotransposon [19], with identical integration sites in neoplastic cells from multiple animals. These data, along with analysis of microsatellites and mitochondrial DNA SNPs, showed that the etiologic agent of this disease is the neoplastic cell itself, as with CTVT and DFTD [20]. The finding of transmissible cancer in soft-shell clams and some evidence from disseminated neoplasia in mussels [21,22] suggested that the neoplasias of other bivalves could also be transmissible. Recently, disseminated neoplasia was analyzed from three additional bivalve species (the mussel, Mytilus trossulus; the cockle, Cerastoderma edule; and the golden carpet-shell clam, Polititapes aureus), and independent transmissible cancer cell lineages were observed in each species [23]. In both mussels and cockles, the cancer lineages arose from the corresponding host species, and in cockles, two apparently independent cockle-derived cancer lineages were found to be spreading through the population, as was found in Tasmanian devils. More unexpectedly, the cancer observed in P. aureus was found to be the result of a cross-species transmission, derived from a different, but related, species, Venerupis corrugata. Curiously, V. corrugata itself does not have a high incidence of the disease, despite living in the same area as P. aureus. As multiple lineages of transmissible cancers are spreading through multiple bivalve species, we call these diseases bivalve transmissible neoplasias (BTN). Since cancers from four bivalve species have been analyzed for transmissible cancer and all four were attributable to BTN lineages, it is reasonable to predict that disseminated neoplasia in other bivalves will be found to be due to BTN as well. Transmissible Cancers in Humans Transmission of cancer from person to person is exceedingly rare. Any cancer cells that somehow find their way from a donor to a recipient would normally be recognized as foreign and rejected by a functional immune system. However, human-to-human transmission of cancer has been reported in several special settings [24], including transplant recipients acquiring cancer from tissue donors [25,26] and transfer of cancer from mother to fetus [27]. In a recent, remarkable case of immune suppression leading to transmission of cancer cells, an AIDS patient acquired neoplastic cells that derived from the cells of a dwarf tapeworm [28]. These are extreme cases in which both physical barriers and immune barriers are lowered. Only a few cases of human cancer transmission without immune suppression or partially matched cells have been reported. There has been a case of a surgeon who accidentally introduced a cancer into his own hand during surgery [29], and at least one case of a needlestick accident leading to growth of a small nodule derived from a human adenocarcinoma cell line in an immunocompetent researcher [30]. In both cases, the tumors were excised and did not recur. Immune Responses to Contagious Cancer In conventional cancers, the cancer cells must avoid recognition of neoantigens made by abnormal induction or mutation of host genes, but contagious cancers must also avoid recognition as cells from a foreign individual. The Major Histocompatibility Complex (MHC) genes in jawed vertebrates form a very strong (but not insurmountable) barrier to cancer allografts. Indeed, CTVT is known to down-regulate expression of MHC genes [31] and carries multiple mutations in genes involved in self-antigen presentation and apoptosis [32]. It has been suggested that lack of MHC diversity in devils contributes to the lack of host recognition of DFTD cells [33,34], but despite their low diversity, experimental allogeneic skin transplants in devils were recognized and rejected [35], and it has recently been shown that DFTD also down-regulates expression of MHC genes [36]. As in CTVT and DFTD, immune evasion plays a key role in the two major settings of human transmissible cancer mentioned above. Cancers transmitted during organ transplants can survive because the recipients are immunosuppressed, and cancers transmitted from mother to fetus are partially tolerated because the maternal cancer cells are half-matched to the fetus—and in at least one case, the neoplastic cells from the mother that grew in the infant had lost the MHC allele that was not shared with the infant [37]. Altogether, the data suggest that MHC-based self–nonself rejection is highly important in preventing transmissible cancers. Bivalves are invertebrates and do not have MHC or any other known histocompatibility system, likely decreasing the barriers for evolution of transmissible cancer lineages. However, there is some evidence for species-specific restriction of BTN lineages, as most known lineages were derived from the original host species, and experimental transmission across species has not been successful [38,39]. Only one example of cross-species transmission from a related species has been observed, and in this case the original host species may have developed resistance to the lineage derived from it [23]. Currently, the mechanisms of these restrictions are unknown. It may resemble self–nonself recognition systems in vertebrates or may be an independent mechanism, as with the BHF gene involved in protecting tunicates from fusion with foreign colonies and stem cell parasitism [40]. Cancer As an Infectious Organism Clonal cancer lineages have spread far beyond their original hosts, replicating as asexual parasites that jump between individuals for decades (and even millennia, in the case of CTVT). It is, in fact, difficult to determine whether to describe these cells as “infecting” or “engrafting” their hosts, reflecting the fact that contagious cancers blur the lines between our normal definitions of cancer and infectious disease. They even stretch our understanding of species, as the living lineage of cells that form CTVT has not been a dog in 10,000 years. It is unlikely that cancer lineages can be considered to be new species, but regardless of terminology, their evolution and replication strategies are clearly distinct from those of their hosts. These contagious cancer lineages have developed a new infectious lifestyle separate from that of the organism from which they arose, and their new interactions provide strong selective pressures, which can affect evolution of both host and pathogen. It is possible that the primary cancers that led to each of these transmissible cancer lineages could have been induced by a conventional pathogen, but so far there is no evidence of this. Some transmissible cancers may have been triggered by transposable elements; there is some indication of the involvement of transposons in CTVT [3] and soft-shell clam-transmissible neoplasia [19]. Regardless of their origin, all available evidence (including multiple genetic markers and full cancer genome sequencing of CTVT and DFTD) leads to the conclusion that the cancer cells themselves have become a pathogen, able to spread from individual to individual. Conclusions Transmissible cancers appear to be limited by two main factors: immune recognition and physical barriers to the spread of cells. In a similar, but less extreme way, conventional cancers can also be thought of as infectious cells that have their own evolutionary pressures and strategies [41,42]. The evasion of immune recognition and the ability to metastasize have obvious parallels in the two main challenges for transmissible cancers. We know that cases of cancer transmission do occur in humans in the special settings of organ transplant and maternal-to-fetal transfer. Transmission would not be likely to occur in immunocompetent people, but we would predict that it is conceivable that certain types of cancer could spread from person to person within immunosuppressed HIV-1-positive populations. Currently, we know of transmissible cancers spreading through natural animal populations in six species across vertebrate and invertebrate organisms and in both terrestrial and marine ecosystems, and further study will likely uncover more examples. Despite recognition only recently as bona fide infectious diseases, transmissible cancers may have been developing, spreading, and producing a strong selective pressure on the evolution of organisms since the beginning of multicellularity.
Abstract The recent extensive application of next-generation sequencing has led to the rapid accumulation of multiple types of data for functional DNA elements. With the advent of precision medicine, the fine-mapping of risk loci based on these elements has become of paramount importance. In this study, we obtained the human reference genome (GRCh38) and the main DNA sequence elements, including protein-coding genes, miRNAs, lncRNAs and single nucleotide polymorphism flanking sequences, from different repositories. We then realigned these elements to identify their exact locations on the genome. Overall, 5%–20% of all sequence element locations deviated among databases, on the scale of kilobase-pair to megabase-pair. These deviations even affected the selection of genome-wide association study risk-associated genes. Our results implied that the location information for functional DNA elements may deviate among public databases. Researchers should take care when using cross-database sources and should perform pilot sequence alignments before element location-based studies. functional elements, sequence alignment, location deviation, precision medicine Introduction The rapid development of high-throughput sequencing technologies and the associated extensive applications has driven biomedical research into the post-genomic era [1, 2]. Many large-scale international cooperative projects have surveyed the human genome [3], and the results of these studies have deepened our understanding of human physiology and pathology [4]. For example, the International HapMap Project (HapMap) has facilitated genome-wide association studies (GWASs) and human genome diversity research; these projects have identified a wide range of genes causing complex diseases and traits [5, 6]. In addition, the 1000 Genomes Project, a human genome map based on large samples from multiple populations, has further increased the depth and breadth of genome research [7, 8], while The Cancer Genome Atlas (TCGA) and other large projects are dedicated to genomic research focused on cancer and other major human diseases [9, 10]. These projects have explored genomic variation, modification, and expression in the etiology of disease and have created conditions for increasingly precise diagnosis and drug development [11, 12]. As genomic science has developed, hundreds of databases have been created, each oriented to different types of DNA sequence elements and/or specific research objectives [13]. The total amount of sequence data generated by large-scale genome projects has rapidly increased [14]. At present, DNA sequencing, RNA sequencing and various databases (e.g. GenBank, miRBase, ENCODE and dbSNP) and comprehensive platforms for sequence searching and downloading (e.g. UCSC and Ensembl) [15–18] provide convenient resources and technical support for functional genomics studies [19, 20]. Many recent studies have included several types of genomic functional elements [21], and investigations of interacting elements have become increasingly common. For example, studies have investigated how microRNAs (miRNAs) target mRNAs [22], how transcription factors and DNA interact [23], or how DNA and long non-coding RNAs (lncRNAs) regulate each other [24]. In addition, studies of polymorphisms and their effects continue to be important. Indeed, single nucleotide polymorphisms (SNPs) and single nucleotide variants (SNVs) are considered useful biomarkers of disease risk, as these factors affect the structures of coding and non-coding genes [25]. Precision medicine has promoted additional cross-omic approaches [26, 27]. As most studies of multiple DNA elements depend on location information acquired from sequence databases, accurate functional element locations within a genome are of great significance, especially for biological and medical studies beyond bioinformatics. In pilot studies, we found that mainstream databases (e.g. GenBank, miRBase, ENCODE and dbSNP) differed slightly with respect to functional element locations, although most of these databases share the same reference genome [28–30]. These discrepancies may perhaps be ascribed to differences in the sequence alignment algorithms, or to data submission quality. However, these inconsistencies are inconvenient and may mislead researchers whose studies rely on the precise locations of functional elements. Therefore, it is necessary to systematically compare DNA element locations from different database sources. Material and Methods Human genome reference sequence acquisition and processing The human reference genome sequence GRCh38, submitted by the Genome Reference Consortium (GRC) on 24 December 2013 [31], was downloaded from GenBank [15]. The reference genome was about 3.4 Gb long and included 22 autosomes, a pair of sex chromosomes and a mitochondrial DNA sequence. The mitochondrial DNA sequence was not included in our study. We retained the sequence information for each chromosome and stored these data in 24 independent files. We removed the headers, blanks, spaces and newline characters from each chromosome file and used the edited files to construct reference genome libraries. Functional DNA element sequence acquisition and processing Gene sequence acquisition and processing We obtained data for 58 137 genes from Ensembl, including not only protein-coding genes but also partial pseudogenes and non-coding genes [32]. This gene set contained 19 786 clearly defined protein-coding genes. The remaining genes were remapped and used for the comparison of gene element locations among databases. Each location data item contained an Ensembl gene ID, associated chromosome, original genome location, gene type and complete sequence (including 5′- and 3′-UTRs). FASTA format sequence libraries were created by removing blank lines and newline characters. Precursor and mature miRNA sequence acquisition and processing We downloaded 1881 precursor miRNA sequences and 2588 mature miRNA sequences from miRBase (Release 22) [16]. We retained the miRBase ID, the original location and the sequence information but removed empty lines and breaks. We then transformed the RNA sequences into cDNA sequences and constructed FASTA format libraries. LncRNA sequence acquisition and processing LncRNA sequences were downloaded from ENCODE (Release 27) [17]: 27 908 transcripts (in which U was been replaced by T) encoding 15 778 lncRNA genes. LncRNA symbols, transcript IDs, original locations and sequences were retained and used to construct FASTA format libraries. Figure 1 View largeDownload slide DNA fragment selection and alignment for different functional elements. Figure 1 View largeDownload slide DNA fragment selection and alignment for different functional elements. SNP-flanking sequence acquisition and processing The SNP-flanking sequences of 324 709 505 SNPs were extracted from the NCBI dbSNP database (Build 150) [18]. SNP ID, original position, length of flanking sequence and allele information were retained. Each SNP allele was represented by the associated International Union of Pure and Applied Chemistry degenerate code. We removed whitespaces and blank lines, changed all lowercase letters into uppercase and then constructed FASTA format libraries using the modified data. We also downloaded SNP position information from HapMap and the 1000 Genomes Project to compare deviations across databases. Functional DNA element alignment and mapping Alignment fragment selection To ensure alignment specificity, the alignment fragment length was set to 25 bp, which yielded a random probability of about 8.88 × 10−16. For long sequence elements, such as protein-coding genes and lncRNAs, we used 25 bp from the head and tail of each sequence as the alignment fragments. For short sequences, such as precursor and mature miRNAs, the whole sequence was aligned. For SNPs, we aligned 25-bp fragments of the up- and down-stream sequences flanking each SNP loci. The details of alignment fragment selection are shown in Figure 1. Functional element sequence alignment After reference genome sequence library construction, we used Bowtie2 to align the functional elements [33, 34]. To generate precise alignments, the Bowtie2 parameters used were ‘-f –score -min L, 0, -0.3’. These settings ensured that the fragments were completely mapped on the reference genome. The alignment results indicated the associated chromosome, strand direction (+/−) and the start/end positions. Due to the broad distribution of DNA variation, we sometimes selected different alignment fragments and constructed a second alignment in order to increase location accuracy. Functional element mapping If both ends of a given sequence had consistent Bowtie2 alignments, the genome location of the elements could be calculated directly. The genome position of the element located on the ‘+’ strand was defined as the 5′-25-bp start site to the 3′-25-bp end; the genome position of the element located on the ‘−’ strand was the converse. The SNP position was defined as the larger of the two ends of the associated flanking sequence mapping results minus 1. The miRNA position was identical to the Bowtie2 result, and the mature miRNA was expected to be located within the host pre-miRNA sequence. If a functional element had location records in the associated source database, we compared these with the alignment results. For spliced lncRNA sequences from ENCODE, the coding DNA fragment location was recovered by the start and end sites of different transcripts. Functional element location filtering and analysis Based on the chromosome, strand, location and sequence length of each mapping result, we classified the obtained locations into seven groups: Perfect Match (PM), Location Deviation (LD), Strand Reversal (SR), Multiple Locations (ML), Chromosome Mismatch (CM), Alignment Location (AL) and Mapping Failure (MF). Results were classified as PMs when all the characters were consistent with the information in the original database. The LD elements were those where the chromosome and strand were consistent between our results and the original database, but the element location differed; location discrepancies inevitably affect subsequent location-based analyses. SR elements were mapped onto the opposite strand by our analysis versus the original database. ML elements were recovered by our analysis in more than one location on the genome. CM elements were mapped on to a different genome by our analysis, in comparison to the original dataset. This type of mismatch is important because cis-acting elements target the same chromosome or DNA molecule, while trans-acting elements target different chromosomes or DNA molecules. Thus, discrepancies in chromosomal locations among databases affect subsequent definitions of functional components. AL elements were those mapped to the genome by our analyses, but which were unmapped in the original database; these new genetic locations will facilitate future analyses. MF elements could not be mapped on the reference genome by our analysis. However, single nucleotide peptides in genomic sequences might cause alignment failures at one or both fragment ends. Thus, for all elements classified as MFs, we extracted, re-aligned and re-classified a 25-bp fragment adjacent to the original fragment. All elements that were still not accurately mapped were considered the final MF group. All elements were classified into one, and only one, group. A detailed flowchart of the classification process is shown in Figure 2. Figure 2 View largeDownload slide Functional element mapping and classification process. Figure 2 View largeDownload slide Functional element mapping and classification process. Multiple database location information comparison Analysis of location deviation effects for GWAS Genes associated with disease risk are primarily identified based on linkages with disease-associated SNPs. It is thus vital that the relative locations of SNPs and SNP-linked genes are accurate. We randomly selected GWAS-identified risk loci and their linked genes within 33 disease phenotypes in dbGAP (e.g. breast neoplasm, coronary artery disease, diabetes, hypertension, Parkinson's disease and prostatic neoplasm) [35], setting the risk threshold to |$1.0\times {10}^{-7}$|. We calculated the distances between the GWAS-identified SNPs and linked genes associated with disease and compared these to the distances identified in our analyses. We determined the potential effects of any location deviations on the GWAS results. SNP location deviations in three key SNP repositories We explored the degree of location deviation for a given functional element type among different data resources using SNPs as an example. First, we selected SNPs from dbSNP that were confirmed to be correctly located by our previous sequence alignment, as well as all the CEU (Europe) SNPs in HapMap and the 1000 Genomes Project. Using the SNP-ID, we compared the locations of the SNPs shared between HapMap and dbSNP, between HapMap and the 1000 Genomes Project and between dbSNP and the 1000 Genomes Project. We then analyzed the deviations in SNP location among these three data resources based on the pairwise comparisons. Next, we extracted the identified MF SNPs from HapMap. We also obtained the minor allele frequencies (MAF) of CEU SNPs. We used a hypergeometric cumulative distribution to calculate the overrepresentation of rare SNPs (MAF < 0.01) in each subset as follows [36–38]: $$ P=1\sum \limits_{i=0}^{m-1}\frac{\left(\begin{array}{l}M\\ {}i\end{array}\right)\left(\begin{array}{l}N-M\\ {}n-i\end{array}\right)}{\left(\begin{array}{l}N\\ {}n\end{array}\right)}, $$ where N is the total number of SNPs in the HapMap CEU population; M is total number of MF SNPs; n is the total number of rare SNPs (MAF < 0.01) in the HapMap CEU population; and m is the number of MF SNPs with a MAF less than 0.01. Probability P described the relationship between SNP allele frequency and mapping failure. Results Gene sequence mapping and analysis Among the 58 137 gene sequences from Ensembl, Bowtie2 alignment identified 52 408 bilateral sequences and 5729 unmapped sequences. We compared the sequence length of each bilateral sequence to the reference sequence and identified 1884 sequence length mismatches. We attempted to realign these 1884 sequences, as well as the 5729 unmapped sequences, using adjacent 25-bp fragments. Across all elements, 85.34% locations were PMs, and 9.85% were MFs (Figure 3A). The remaining element locations (4.81%) were corrected by further subdivision and were assigned to the ML, LD, CM or SR class. Figure 3B shows the classification of protein-coding genes after Bowtie2 alignment. These results indicate that, although the locations of most genes in most databases are accurate, some genes require remapping. These inconsistencies may have serious impacts on follow-up analyses. Pre-miRNA and miRNA sequence mapping and analysis We found that 95.22% of the 1881 pre-miRNA sequences obtained from miRBase were PMs, and ~4.78% were corrected to LDs or MLs (Figure 3E). None of the 1881 sequences was classified as SRs, ALs or MFs. Of the 2588 human miRNA sequences in miRBase, 94.51% were PMs, and ~5.49% were corrected to LDs or MLs (Figure 3F). None of the 2588 sequences was classified as SRs, ALs or MFs. Figure 3 View largeDownload slide Functional element location situation distributions. Figure 3 View largeDownload slide Functional element location situation distributions. LncRNA transcript sequence mapping and analysis Of the 27 908 lncRNA sequences in ENCODE, Bowtie2 alignment mapped 24 556 bilateral sequences, while 3352 sequences were MFs. We compared the Bowtie-derived lengths to the sequence lengths in ENCODE and identified 490 sequence length mismatches. We attempted to realign these 490 sequences, as well as the 3352 MFs, using additional 25-bp sequence fragments. Of the 3842 realigned sequences, 85.81% were PMs, 32 (0.11%) were CMs, 41 (0.15%) were SRs, 197 (0.71%) were LDs and 337 (1.21%) were MLs (Figure 3C). SNP position mapping and analysis The Bowtie2 alignment of the sequences flanking the 324 709 505 SNPs obtained from dbSNP recovered 324 613 893 bilateral alignments and 95 612 unmapped sequences. We compared the sequence length of each bilateral result derived from Bowtie2 to its reference sequence and identified 62 285 116 sequence length mismatches. We realigned these 62 285 116 sequences, as well as the 95 612 MFs. Of the realigned sequences, 75.20% were PMs, 19.21% were MFs and ~5.59% were corrected to LDs, MLs, CMs, ALs or SRs (Figure 3D). Location deviations of risk SNPs and susceptible genes in GWAS In the corrected gene and SNP location datasets, all risk-associated SNPs from dbGaP, in all disease phenotypes, were located on the same chromosome as their linked genes. In subsequent comparisons, the location deviation was considered ‘0’ if the SNP was located exactly within the range of its linked gene. For all other SNPs, we calculated the shortest distance between the SNP and either end of the gene. These distances were divided into three categories: less than 100 kb, between 100 kb and 1 Mb and more than 1 Mb (Table 1). We found that 0%–31.67% of the risk-associated SNPs were located more than 100 kb from the original risk-associated gene, which exceeds the extent of linkage disequilibrium blocks. Table 1 Location deviation of GWAS risk SNPs and linkage genes in dbGaP. The first column of the table shows disease phenotypes in dbGaP. Other columns represent the percentage of different location deviation distribution Phenotype 0 (%) x ≤ 100 Kb (%) 100 Kb–1 Mb (%) x > 1 Mb (%) Anatomy category 44.76 23.57 27.62 4.05 Bacterial infections and mycoses 8.91 89.11 1.98 0.00 Behavior and behavior mechanisms 51.04 21.68 23.78 3.50 Behavioral disciplines and activities 58.82 17.65 23.53 0.00 Biological marker 56.24 25.00 15.63 3.13 Body weight and measure 54.17 24.11 18.30 3.42 Cardiovascular disease 54.34 31.76 12.90 1.00 Chemical and drugs category 59.63 25.75 12.78 1.84 Congenital, hereditary, and neonatal diseases and abnormalities 46.30 31.48 22.22 0.00 Diagnostic techniques and procedures 57.74 23.10 16.54 2.62 Digestives system disease 42.34 40.05 15.82 1.79 Disorder of environment origin 52.63 26.32 21.05 0.00 Endocrine system disease 51.13 40.81 7.30 0.76 Eye diseases 57.44 37.44 5.12 0.00 Female urogenital diseases and pregnancy complications 46.66 36.00 16.67 0.67 Hemic and lymphatic disease 56.63 32.53 9.64 1.20 Immune system disease 42.19 50.90 6.33 0.58 Laboratory techniques and procedure 62.16 29.34 7.72 0.78 Male urogenital disease 41.56 38.68 18.93 0.83 Mental disorders 54.18 26.69 17.13 2.00 Musculoskeletal diseases 44.32 36.36 18.18 1.14 Neoplasms 55.45 30.51 12.83 1.21 Nervous system diseases 46.65 43.10 9.62 0.63 Nutritional and metabolic disease 54.10 24.46 19.88 1.57 Otorhinolaryngologic disease 45.00 45.00 10.00 0.00 Parasitic diseases 0.00 100.00 0.00 0.00 Pathological conditions, signs and symptoms 55.17 22.13 20.40 2.30 Physical examination 54.50 22.97 19.22 3.31 Physical phenomena and processes 58.21 12.69 27.61 1.49 Respiratory tract diseases 46.95 45.22 7.83 0.00 Skin and connective tissue diseases 41.06 52.21 6.37 0.36 Stomatognathic diseases 40.00 45.71 14.29 0.00 Virus diseases 80.00 20.00 0.00 0.00 Others 54.27 36.11 8.53 1.09 Phenotype 0 (%) x ≤ 100 Kb (%) 100 Kb–1 Mb (%) x > 1 Mb (%) Anatomy category 44.76 23.57 27.62 4.05 Bacterial infections and mycoses 8.91 89.11 1.98 0.00 Behavior and behavior mechanisms 51.04 21.68 23.78 3.50 Behavioral disciplines and activities 58.82 17.65 23.53 0.00 Biological marker 56.24 25.00 15.63 3.13 Body weight and measure 54.17 24.11 18.30 3.42 Cardiovascular disease 54.34 31.76 12.90 1.00 Chemical and drugs category 59.63 25.75 12.78 1.84 Congenital, hereditary, and neonatal diseases and abnormalities 46.30 31.48 22.22 0.00 Diagnostic techniques and procedures 57.74 23.10 16.54 2.62 Digestives system disease 42.34 40.05 15.82 1.79 Disorder of environment origin 52.63 26.32 21.05 0.00 Endocrine system disease 51.13 40.81 7.30 0.76 Eye diseases 57.44 37.44 5.12 0.00 Female urogenital diseases and pregnancy complications 46.66 36.00 16.67 0.67 Hemic and lymphatic disease 56.63 32.53 9.64 1.20 Immune system disease 42.19 50.90 6.33 0.58 Laboratory techniques and procedure 62.16 29.34 7.72 0.78 Male urogenital disease 41.56 38.68 18.93 0.83 Mental disorders 54.18 26.69 17.13 2.00 Musculoskeletal diseases 44.32 36.36 18.18 1.14 Neoplasms 55.45 30.51 12.83 1.21 Nervous system diseases 46.65 43.10 9.62 0.63 Nutritional and metabolic disease 54.10 24.46 19.88 1.57 Otorhinolaryngologic disease 45.00 45.00 10.00 0.00 Parasitic diseases 0.00 100.00 0.00 0.00 Pathological conditions, signs and symptoms 55.17 22.13 20.40 2.30 Physical examination 54.50 22.97 19.22 3.31 Physical phenomena and processes 58.21 12.69 27.61 1.49 Respiratory tract diseases 46.95 45.22 7.83 0.00 Skin and connective tissue diseases 41.06 52.21 6.37 0.36 Stomatognathic diseases 40.00 45.71 14.29 0.00 Virus diseases 80.00 20.00 0.00 0.00 Others 54.27 36.11 8.53 1.09 View Large Table 1 Location deviation of GWAS risk SNPs and linkage genes in dbGaP. The first column of the table shows disease phenotypes in dbGaP. Other columns represent the percentage of different location deviation distribution Phenotype 0 (%) x ≤ 100 Kb (%) 100 Kb–1 Mb (%) x > 1 Mb (%) Anatomy category 44.76 23.57 27.62 4.05 Bacterial infections and mycoses 8.91 89.11 1.98 0.00 Behavior and behavior mechanisms 51.04 21.68 23.78 3.50 Behavioral disciplines and activities 58.82 17.65 23.53 0.00 Biological marker 56.24 25.00 15.63 3.13 Body weight and measure 54.17 24.11 18.30 3.42 Cardiovascular disease 54.34 31.76 12.90 1.00 Chemical and drugs category 59.63 25.75 12.78 1.84 Congenital, hereditary, and neonatal diseases and abnormalities 46.30 31.48 22.22 0.00 Diagnostic techniques and procedures 57.74 23.10 16.54 2.62 Digestives system disease 42.34 40.05 15.82 1.79 Disorder of environment origin 52.63 26.32 21.05 0.00 Endocrine system disease 51.13 40.81 7.30 0.76 Eye diseases 57.44 37.44 5.12 0.00 Female urogenital diseases and pregnancy complications 46.66 36.00 16.67 0.67 Hemic and lymphatic disease 56.63 32.53 9.64 1.20 Immune system disease 42.19 50.90 6.33 0.58 Laboratory techniques and procedure 62.16 29.34 7.72 0.78 Male urogenital disease 41.56 38.68 18.93 0.83 Mental disorders 54.18 26.69 17.13 2.00 Musculoskeletal diseases 44.32 36.36 18.18 1.14 Neoplasms 55.45 30.51 12.83 1.21 Nervous system diseases 46.65 43.10 9.62 0.63 Nutritional and metabolic disease 54.10 24.46 19.88 1.57 Otorhinolaryngologic disease 45.00 45.00 10.00 0.00 Parasitic diseases 0.00 100.00 0.00 0.00 Pathological conditions, signs and symptoms 55.17 22.13 20.40 2.30 Physical examination 54.50 22.97 19.22 3.31 Physical phenomena and processes 58.21 12.69 27.61 1.49 Respiratory tract diseases 46.95 45.22 7.83 0.00 Skin and connective tissue diseases 41.06 52.21 6.37 0.36 Stomatognathic diseases 40.00 45.71 14.29 0.00 Virus diseases 80.00 20.00 0.00 0.00 Others 54.27 36.11 8.53 1.09 Phenotype 0 (%) x ≤ 100 Kb (%) 100 Kb–1 Mb (%) x > 1 Mb (%) Anatomy category 44.76 23.57 27.62 4.05 Bacterial infections and mycoses 8.91 89.11 1.98 0.00 Behavior and behavior mechanisms 51.04 21.68 23.78 3.50 Behavioral disciplines and activities 58.82 17.65 23.53 0.00 Biological marker 56.24 25.00 15.63 3.13 Body weight and measure 54.17 24.11 18.30 3.42 Cardiovascular disease 54.34 31.76 12.90 1.00 Chemical and drugs category 59.63 25.75 12.78 1.84 Congenital, hereditary, and neonatal diseases and abnormalities 46.30 31.48 22.22 0.00 Diagnostic techniques and procedures 57.74 23.10 16.54 2.62 Digestives system disease 42.34 40.05 15.82 1.79 Disorder of environment origin 52.63 26.32 21.05 0.00 Endocrine system disease 51.13 40.81 7.30 0.76 Eye diseases 57.44 37.44 5.12 0.00 Female urogenital diseases and pregnancy complications 46.66 36.00 16.67 0.67 Hemic and lymphatic disease 56.63 32.53 9.64 1.20 Immune system disease 42.19 50.90 6.33 0.58 Laboratory techniques and procedure 62.16 29.34 7.72 0.78 Male urogenital disease 41.56 38.68 18.93 0.83 Mental disorders 54.18 26.69 17.13 2.00 Musculoskeletal diseases 44.32 36.36 18.18 1.14 Neoplasms 55.45 30.51 12.83 1.21 Nervous system diseases 46.65 43.10 9.62 0.63 Nutritional and metabolic disease 54.10 24.46 19.88 1.57 Otorhinolaryngologic disease 45.00 45.00 10.00 0.00 Parasitic diseases 0.00 100.00 0.00 0.00 Pathological conditions, signs and symptoms 55.17 22.13 20.40 2.30 Physical examination 54.50 22.97 19.22 3.31 Physical phenomena and processes 58.21 12.69 27.61 1.49 Respiratory tract diseases 46.95 45.22 7.83 0.00 Skin and connective tissue diseases 41.06 52.21 6.37 0.36 Stomatognathic diseases 40.00 45.71 14.29 0.00 Virus diseases 80.00 20.00 0.00 0.00 Others 54.27 36.11 8.53 1.09 View Large Location deviation of SNPs among data resources The results of the pairwise comparisons among HapMap, dbSNP and the 1000 Genomes Project were sorted by deviation degree. We then constructed a 100% stacked column chart of the number of SNP location deviations between any pair of databases (Figure 4). We found that many (36.36%) of the SNPs shared between HapMap and dbSNP had location deviations >1 Mb, as did many of the SNPs shared between HapMap and the 1000 Genomes Project (Figure 4C and A). These discrepancies may have been due to the different release of the reference genome used by HapMap. However, even though dbSNP and the 1000 Genomes Project use the same version of the reference genome, many SNP location deviations were still identified between these databases (Figure 4B). Figure 4 View largeDownload slide Location deviations of SNP among dbSNP, 1000 Genomes Project and HapMap. A. Location comparison between HapMap and 1000 Genomes Project; B. Location comparison between dbSNP and 1000 Genomes Project; C. Location comparison between dbSNP and HapMap. Four squares with different colors in each core represent the location deviation of SNP between diverse resources. The bar of inside circle shows the proportion of location deviation on each chromosome. The external circle of each subgraph shows 24 chromosomes. Figure 4 View largeDownload slide Location deviations of SNP among dbSNP, 1000 Genomes Project and HapMap. A. Location comparison between HapMap and 1000 Genomes Project; B. Location comparison between dbSNP and 1000 Genomes Project; C. Location comparison between dbSNP and HapMap. Four squares with different colors in each core represent the location deviation of SNP between diverse resources. The bar of inside circle shows the proportion of location deviation on each chromosome. The external circle of each subgraph shows 24 chromosomes. The hypergeometric cumulative distribution test showed that, in general, the MF SNPs were overrepresented among the SNPs with low MAF (MAF < 0.01; P < 3.45 × 10−6). This indicated that lower frequency SNPs were more likely to be false-positive errors. This may be because lower frequency SNPs/SNVs in a given population are sometimes presented as singletons, resulting in insufficient research and an absence of double-hit experimental confirmation. Discussion The accurate location of genomic functional elements, especially across multiple element types, is important for biomedical research [39]. In this study, inconsistent location information was identified by the realignment of functional element sequences from various databases to the reference genome. For gene sequences from Ensembl, the overall location accuracy was 85.34%. Notably, protein-coding gene location consistency was 92.86%, probably due to the depth and breadth of the available global research on coding genes. Nevertheless, the observed 7.14% of coding genes with inconsistent locations was symptomatic of the potential problems across mainstream databases. The location accuracy of miRNAs and pre-miRNAs in miRBase was about 94.51% and 95.22%, respectively. This high accuracy might be due to the short length of these sequences, as well as their relative rarity. Compared to other RNA types, miRNAs have received much research attention, which also has improved location reliability. Sustained attention to fewer than 2000 miRNAs has led to a relatively full understanding, which has improved reliability. As the average lengths of miRNAs and pre-miRNAs are only 21 bp and 80 bp, respectively, these sequences can be easily mapped onto the genome. However, random alignment errors may also be produced. Only 85.81% of the lncRNAs from ENCODE were classified PMs, far less than the percentages of PM protein-coding genes and pre-mRNAs. This difference might be explained by the rapid increase in the numbers of lncRNAs recognized in recent years due to next-generation sequencing (NGS). In addition, effective classification, nomenclature and sequence submission criteria are lacking for lncRNAs, obviously impacting data standardization. Here, we obtained reliable locations for some lncRNAs using sequence alignments and mapping. These locations might be useful for lncRNA functional analyses and biological annotation. The location accuracy for SNPs from dbSNP was only about 75.20%. In addition, relatively few SNPs were PMs. These location problems might have been due to the low frequency of some SNPs in the human population. Therefore, we performed a hypergeometric cumulative distribution test on the MF and low-frequency SNPs in HapMap. We found that MF SNPs are overrepresented among the low-frequency SNPs (MAF < 0.01 and P < 3.45 × 10−6), suggesting that SNP frequency affects alignment-derived locations. dbSNP is a comprehensive database of variation, which accepts data from public projects and private research organizations. More recently, HapMap and the 1000 Genomes Project have been developed. Most groups submitting to these databases use NGS, which generates many SNPs and SNVs [40]. However, the diverse methods of data submission used by the various groups have given rise to many data consistency problems, such as a lack of data validation and low-quality data. When comparing and analyzing data from different sources, we identified a certain degree of deviation both between GWAS disease-risk SNPs and their linked genes and between SNPs and their associated lncRNAs. We also found data source differences in the genomic locations of the same types of functional DNA elements. Some of these deviations were caused by differences in the release version of the reference genomes used by the databases when locating functional elements. However, even when the reference genome version was the same, relatively minor deviations were still observed, due to differences in other important steps, such as data preprocessing, alignment strategy and parameter setting. In our study, we download reference genome sequences from UCSC. After fragment alignment using Bowtie2, the genome location of each element was calculated. We categorized 222 genes, 103 lncRNAs and 58 PCGs as LDs and categorized 207 genes, 101 lncRNAs and 55 PCGs as PMs. This difference might be due to the constant updating of the database. We also identified 86 genes, 16 lncRNAs and 14 PCGs at different locations on the same chromosome and strand, consistent with our previous results (e.g. ENSG00000273610: chr1, +, 21 987 481–21 987 777 and chr1, +, 22 010 650–22 010 946). We also identified extra matching positions during the mapping process (e.g. ENSG00000268993: chr1, −, 121 142 051–121 142 438; chr1, −, 206 160 889–206 161 276; and chr1, −, 143 929 995–143 930 382). The LD gene, PCG and lncRNA sequences were compared to the reference genome using BLAST. BLAST showed that 105 genes, 51 lncRNAs and 34 PCGs were PMs, while 86 genes, 9 lncRNAs and 9 PCGs had different locations on the same chromosome and strand. These latter genes, lncRNAs and PCGs were also identified by our Bowtie2 analysis. However, since Bowtie2 identified an additional seven lncRNAs and five PCGs, Bowtie2 might be more suitable for mapping short fragments to reference genomes. In our study, we found that many participants in an SNP–gene linkage relationship in GWAS are disturbed by gene and SNP location deviations in the dbGaP database. To universalize our study, we used the GWAS catalog database to perform a further study, by comparing the mapping results from the dbGaP and GWAS catalog. In this process, we mainly focused on prostatic neoplasms, breast neoplasms, colorectal neoplasms, lung neoplasms, ovarian neoplasms and stomach neoplasms, which have been identified as SNPs with the most risk. In the dbGaP database, we found 460 SNPs and 383 genes, while in the GWAS catalog database, there were 581 SNPs and 594 genes. For the mapping results, we found that there were 18 and 19 genes that couldn't match the accuracy position for the dbGaP and GWAS catalog, respectively. The results showed that there were some errors in gene mapping due to gene location deviation in both the risk SNP genotype and phenotype databases. By comparing with the result of the UCSC and Ensembl gene location analysis, we infer that some of the information stored in the database has been revised with the update of data release, while information has not been updated in time for other databases. As described above, for the six diseases (prostatic neoplasms, breast neoplasms, colorectal neoplasms, lung neoplasms, ovarian neoplasms and stomach neoplasms), we performed the SNP and gene alignment, from the dbGap and GWAS catalog, and found many mismatches between risk SNPs and their capture genes (Supplementary Table S1). For example, the original location of MIR4752 (ENSG00000264703) in Ensembl is on chromosome 19, +, 54 282 109–54 282 180, but our alignments show that its real location is chromosome 19, +, 54 629 392–54 629 463, and the horizontal shift is 347 283 bp. In the prostatic neoplasm's genome-wide association study, the identified risk SNP rs103294 (chromosome 19, 54 293 995) is adjacent to the original location of MIR4752 (with the distance of 11 815 bp). However, the re-alignment location of MIR4752 is obviously beyond (335 397 bp) the scope of linkage disequilibrium to rs103294. This means that the candidate for prostate cancer risk microRNA mir-4752 may not be the real risk factor, based on the theory of gene association study. The similar samples also exist in breast neoplasms, colorectal neoplasms and ovarian neoplasms. Our results suggest that the functional DNA element sequence and location information deviates among public databases, reminding researchers to be careful when using cross-database sources. In particular, sequences should be realigned first, especially in element location-based studies. Key Points Relatively unneglectable discrepancies/deviations are noted when performing a comprehensive analysis of location alignments in functional DNA elements across several sequence repositories. A degree of up to 20% location deviations is detected in different scales of kilobase- to megabase-pair location alignment analysis. The GWAS candidate gene mapping is affected by SNP location deviations in a large part of disease and phenotype studies. A sequence alignment work strongly suggested before disease phenotype studies, especially for element location-based studies. Authors' contributions H.Z., X.Z. and H.W. carried out the bioinformatics analysis and drafted the manuscript. Y.D., X.L., G.Z. and J.Y. performed the bioinformatics analysis, drafted the manuscript and participated in the manuscript revision process. L.W., H.Z., Y.B., J.L. and J.W. plotted the main figures in the context. H.W., Y.J. and L.X. presented this idea, designed the experiment and guided the whole research process. All authors read and approved the final manuscript. Acknowledgements The author wish to thank all members and the support from the Training Center for Students Innovation and Entrepreneurship Education, Harbin Medical University, Harbin 150081, China. Funding This work was supported by the National Natural Science Foundation of China (grant numbers 31801098, 91746113 and 31501062), the Research Projects of Education Department of Heilongjiang Province (grant numbers 1254G041 and 12511273), the Research Project of Health Department of Heilongjiang Province (grant number 2011-249), the Harbin Science and Technology Bureau project (grant number RC2013QN004112), the Fundamental Research Funds for the Provincial Universities (grant number 2017JCZX50) and the Internal Fund Project of Eye Hospital of Wenzhou Medical University (grant number YJGG20181001). The College of Bioinformatics Science and Technology, of Harbin Medical University, is one of the largest computational biology research organizations in China, which has made valuable achievements in bioinformatics research of complex diseases. The School of Ophthalmology & Optometry and Eye Hospital, School of Biomedical Engineering of Wenzhou Medical University, is one of the most famous ophthalmic medical organizations in China, which has achieved a high level of research in ophthalmic disease mechanism and biomedical big data. Hewei Zheng is a graduate student jointly trained by Harbin Medical University and Wenzhou Medical University. His main research interests are RNA molecular genetics and structural bioinformatics. Xueying Zhao is a research assistant at the Institute of Hematology and Blood Diseases Hospital, Chinese Academy of Medical Sciences. Her main research interests are RNA molecular genetics and structural bioinformatics. Hong Wang is an assistant professor at Harbin Medical University. Her research interests include complex disease bioinformatics and molecular genetics. Yu Ding is a graduate student jointly trained by Harbin Medical University and Wenzhou Medical University. His main research interests are RNA and protein structural biology. Xiaoyan Lu is a graduate student jointly trained by Harbin Medical University and Wenzhou Medical University. Her main research interests are RNA molecular genetics and structural bioinformatics. Guosi Zhang a graduate student jointly trained by Harbin Medical University and Wenzhou Medical University. His main research interests are RNA molecular genetics and structural bioinformatics. Jiaxin Yang is a graduate student jointly trained by Harbin Medical University and Wenzhou Medical University. Her main research interests are microbiological genomics and computational biology. Lianzong Wang is a graduate student jointly trained by Harbin Medical University and Wenzhou Medical University. His main research interests are lncRNA regulatory bioinformatics and system biology. Haotian Zhang is an undergraduate student jointly trained by Harbin Medical University and Wenzhou Medical University. His main research interests include bioinformatics algorithm and structural bioinformatics. Yu Bai is a graduate student jointly trained by Harbin Medical University and Wenzhou Medical University. Her main research interests are RNA molecular genetics and structural bioinformatics. Jing Li is a graduate student jointly trained by Harbin Medical University and Wenzhou Medical University. Her main research interests are RNA expression and molecular genetics. Jingqi Wu is a graduate student jointly trained by Harbin Medical University and Wenzhou Medical University. Her main research interests are microbiology and bioinformatics. Yongshuai Jiang is an associate professor at Harbin Medical University. His main research interests include complex disease bioinformatics and statistical genetics. Liangde Xu is an associate professor at Eye Hospital of Wenzhou Medical University and Harbin Medical University. His main research interests include complex disease bioinformatics and molecular genetics. References 1. Mardis ER . The impact of next-generation sequencing technology on genetics . Trends Genet 2008 ; 24 : 133 – 41 . Google Scholar Crossref Search ADS PubMed WorldCat 2. sequencing MV N-g . The genome jigsaw . Nature 2013 ; 501 : 263 – 8 . Google Scholar Crossref Search ADS PubMed WorldCat 3. Green ED , Watson JD , Collins FS . Human Genome Project: twenty-five years of big biology . Nature 2015 ; 526 : 29 – 31 . Google Scholar Crossref Search ADS PubMed WorldCat 4. Zhou M , Hu L , Zhang Z , et al. Recurrence-associated long non-coding RNA signature for determining the risk of recurrence in patients with colon cancer . Mol Ther Nucleic Acids 2018 ; 12 : 518 – 29 . Google Scholar Crossref Search ADS PubMed WorldCat 5. Couzin J . Human genome. HapMap launched with pledges of $100 million . Science 2002 ; 298 : 941 – 2 . Google Scholar Crossref Search ADS PubMed WorldCat 6. International HapMap C . The International HapMap Project . Nature 2003 ; 426 : 789 – 96 . Crossref Search ADS PubMed WorldCat 7. Siva N . 1000 Genomes Project . Nat Biotechnol 2008 ; 26 : 256 . Google Scholar Crossref Search ADS PubMed WorldCat 8. Kuehn BM . 1000 Genomes Project finds substantial genetic variation among populations . JAMA 2012 ; 308 : 2322, 2325 . Google Scholar PubMed WorldCat 9. Kim HS , Minna JD , White MA . GWAS meets TCGA to illuminate mechanisms of cancer predisposition . Cell 2013 ; 152 : 387 – 9 . Google Scholar Crossref Search ADS PubMed WorldCat 10. Zhou M , Zhao H , Wang X , et al. Analysis of long noncoding RNAs highlights region-specific altered expression patterns and diagnostic roles in Alzheimer's disease . Brief Bioinform 2018 . WorldCat 11. Tomczak K , Czerwinska P , Wiznerowicz M . The Cancer Genome Atlas (TCGA): an immeasurable source of knowledge . Contemp Oncol (Pozn) 2015 ; 19 : A68 – 77 . Google Scholar PubMed WorldCat 12. Ding Y , Wang H , Zheng H , et al. Evaluation of drug efficacy based on the spatial position comparison of drug-target interaction centers . Brief Bioinform 2019 . WorldCat 13. Pandey A , Lewitter F . Nucleotide sequence databases: a gold mine for biologists . Trends Biochem Sci 1999 ; 24 : 276 – 80 . Google Scholar Crossref Search ADS PubMed WorldCat 14. O'Rawe JA , Ferson S , Lyon GJ . Accounting for uncertainty in DNA sequencing data . Trends Genet 2015 ; 31 : 61 – 6 . Google Scholar Crossref Search ADS PubMed WorldCat 15. Burks C , Cinkosky MJ , Fischer WM , et al. GenBank . Nucleic Acids Res 1992 ; 20 ( Suppl ): 2065 – 9 . Google Scholar Crossref Search ADS PubMed WorldCat 16. Griffiths-Jones S , Grocock RJ , van Dongen S , et al. miRBase: microRNA sequences, targets and gene nomenclature . Nucleic Acids Res 2006 ; 34 : D140 – 4 . Google Scholar Crossref Search ADS PubMed WorldCat 17. Consortium EP . The ENCODE (ENCyclopedia Of DNA Elements) Project . Science 2004 ; 306 : 636 – 40 . Crossref Search ADS PubMed WorldCat 18. Smigielski EM , Sirotkin K , Ward M , et al. dbSNP: a database of single nucleotide polymorphisms . Nucleic Acids Res 2000 ; 28 : 352 – 5 . Google Scholar Crossref Search ADS PubMed WorldCat 19. Hubbard T , Barker D , Birney E , et al. The Ensembl genome database project . Nucleic Acids Res 2002 ; 30 : 38 – 41 . Google Scholar Crossref Search ADS PubMed WorldCat 20. Karolchik D , Baertsch R , Diekhans M , et al. The UCSC Genome Browser Database . Nucleic Acids Res 2003 ; 31 : 51 – 4 . Google Scholar Crossref Search ADS PubMed WorldCat 21. Bernstein BE , Kellis M . Large-scale discovery and validation of functional elements in the human genome . Genome Biol 2005 ; 6 : 312 . Google Scholar Crossref Search ADS PubMed WorldCat 22. Yang C , Sun C , Liang X , et al. Integrative analysis of microRNA and mRNA expression profiles in non-small-cell lung cancer . Cancer Gene Ther 2016 ; 23 : 90 – 7 . Google Scholar Crossref Search ADS PubMed WorldCat 23. Arlt A , Sebens S , Krebs S , et al. Inhibition of the Nrf2 transcription factor by the alkaloid trigonelline renders pancreatic cancer cells more susceptible to apoptosis through decreased proteasomal gene expression and proteasome activity . Oncogene 2013 ; 32 : 4825 – 35 . Google Scholar Crossref Search ADS PubMed WorldCat 24. Zhou M , Diao Z , Yue X , et al. Construction and analysis of dysregulated lncRNA-associated ceRNA network identified novel lncRNA biomarkers for early diagnosis of human pancreatic cancer . Oncotarget 2016 ; 7 : 56383 – 94 . Google Scholar PubMed WorldCat 25. Lam ET , Bracci PM , Holly EA , et al. Mitochondrial DNA sequence variation and risk of pancreatic cancer . Cancer Res 2012 ; 72 : 686 – 95 . Google Scholar Crossref Search ADS PubMed WorldCat 26. Kaiser JBIOMEDICINE . NIH opens precision medicine study to nation . Science 2015 ; 349 : 1433 . Google Scholar Crossref Search ADS PubMed WorldCat 27. Aronson SJ , Rehm HL . Building the foundation for genomics in precision medicine . Nature 2015 ; 526 : 336 – 42 . Google Scholar Crossref Search ADS PubMed WorldCat 28. Wang H , Lu X , Chen F , et al. Landscape of SNPs-mediated lncRNA structural variations and their implication in human complex diseases . Brief Bioinform 2018 . WorldCat 29. Wang H , Zheng H , Wang C , et al. Insight into HOTAIR structural features and functions as landing pads for transcription regulation proteins . Biochem Biophys Res Commun 2017 ; 485 : 679 – 85 . Google Scholar Crossref Search ADS PubMed WorldCat 30. Wang C , Wang L , Ding Y , et al. LncRNA structural characteristics in epigenetic regulation . Int J Mol Sci 2017 ; 18 . WorldCat 31. Guo Y , Dai Y , Yu H , et al. Improvements and impacts of GRCh38 human reference on high throughput sequencing data analysis . Genomics 2017 ; 109 : 83 – 90 . Google Scholar Crossref Search ADS PubMed WorldCat 32. Birney E , Andrews TD , Bevan P , et al. An overview of Ensembl . Genome Res 2004 ; 14 : 925 – 8 . Google Scholar Crossref Search ADS PubMed WorldCat 33. Langmead B . Aligning short sequencing reads with Bowtie . Curr Protoc Bioinformatics , Chapter 11:Unit 2010 ; 11 : 17 . WorldCat 34. Langmead B , Salzberg SL . Fast gapped-read alignment with Bowtie 2 . Nat Methods 2012 ; 9 : 357 – 9 . Google Scholar Crossref Search ADS PubMed WorldCat 35. Tryka KA , Hao L , Sturcke A , et al. NCBI's database of Genotypes and Phenotypes: dbGaP . Nucleic Acids Res 2014 ; 42 : D975 – 9 . Google Scholar Crossref Search ADS PubMed WorldCat 36. Mazza T , Mazzoccoli G , Fusilli C , et al. Multifaceted enrichment analysis of RNA-RNA crosstalk reveals cooperating micro-societies in human colorectal cancer . Nucleic Acids Res 2016 ; 44 : 4025 – 36 . Google Scholar Crossref Search ADS PubMed WorldCat 37. Vossen DM , Verhagen CVM , Grenman R , et al. Role of variant allele fraction and rare SNP filtering to improve cellular DNA repair endpoint association . PLoS One 2018 ; 13 :e0206632. WorldCat 38. Chen SF , Chao YL , Shen YC , et al. Resequencing and association study of the NFKB activating protein-like gene (NKAPL) in schizophrenia . Schizophr Res 2014 ; 157 : 169 – 74 . Google Scholar Crossref Search ADS PubMed WorldCat 39. Lai WK , Buck MJ . ArchAlign: coordinate-free chromatin alignment reveals novel architectures . Genome Biol 2010 ; 11 : R126 . Google Scholar Crossref Search ADS PubMed WorldCat 40. de Vries PS , Sabater-Lleal M , Chasman DI , et al. Comparison of HapMap and 1000 genomes reference panels in a large-scale genome-wide association study . PLoS One 2017 ; 12 :e0167742. WorldCat Author notes Hewei Zheng, Xueying Zhao, and Hong Wang contributed equally to this work. © The Author(s) 2019. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected] This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)