Amino Acid Changes in Disease-Associated Variants Differ Radically from Variants Observed in the 1000 Genomes Project Dataset

doi:10.1371/journal.pcbi.1003382

Amino Acid Changes in Disease-Associated Variants Differ Radically from Variants Observed in the 1000 Genomes Project Dataset

2013-12-12 00:00:00 The 1000 Genomes Project data provides a natural background dataset for amino acid germline mutations in humans. Since the direction of mutation is known, the amino acid exchange matrix generated from the observed nucleotide variants is asymmetric and the mutabilities of the different amino acids are very different. These differences predominantly reflect preferences for nucleotide mutations in the DNA (especially the high mutation rate of the CpG dinucleotide, which makes arginine mutability very much higher than other amino acids) rather than selection imposed by protein structure constraints, although there is evidence for the latter as well. The variants occur predominantly on the surface of proteins (82%), with a slight preference for sites which are more exposed and less well conserved than random. Mutations to functional residues occur about half as often as expected by chance. The disease-associated amino acid variant distributions in OMIM are radically different from those expected on the basis of the 1000 Genomes dataset. The disease-associated variants preferentially occur in more conserved sites, compared to 1000 Genomes mutations. Many of the amino acid exchange profiles appear to exhibit an anti-correlation, with common exchanges in one dataset being rare in the other. Disease-associated variants exhibit more extreme differences in amino acid size and hydrophobicity. More modelling of the mutational processes at the nucleotide level is needed, but these observations should contribute to an improved prediction of the effects of specific variants in humans. Citation: de Beer TAP, Laskowski RA, Parks SL, Sipos B, Goldman N, et al. (2013) Amino Acid Changes in Disease-Associated Variants Differ Radically from Variants Observed in the 1000 Genomes Project Dataset. PLoS Comput Biol 9(12): e1003382. doi:10.1371/journal.pcbi.1003382 Editor: Yana Bromberg, Rutgers University, United States of America Received April 29, 2013; Accepted October 22, 2013; Published December 12, 2013 Copyright: 2013 de Beer et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Funding: This work was supported in part by the National Institutes of Health grant GM094585, by the U. S. Department of Energy, Office of Biological and Environmental Research, under contract DE-AC02-06CH11357 (Midwest Center for Structural Genomics) as well as EMBL-EBI. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Competing Interests: The authors have declared that no competing interests exist. * E-mail: [email protected] exploring their structural characteristics and preferences. The Introduction reports from the 1000 Genomes Consortium [1,2] have focused on With the release of the 1000 Genomes Project (1 kG) data [1], it genome and nucleotide variation, and other papers consider has become feasible to study human protein variation on a large mutations in association with a specific disease (e.g. cancer) [3]. scale. The main aim of the 1 kG project was to discover and Various databases such as the Online database of Mendelian characterize at least 95% of human DNA variants (with a Inheritance in Man (OMIM, [4]), the UniProtKB human frequency of occurrence of .1%) found in multiple human polymorphism set (Humsavar, [5]) and the Human Gene populations across the world. Five main populations were sampled Mutation Database (HGMD, [6]) collect information on inherited with ancestry in Europe, West Africa, the Americas, East Asia and diseases associated with variants. The Humsavar database South Asia. The project has provided a rich set of synonymous contains disease-associated variants from the literature and (sSNPs) and non-synonymous (nsSNPs) variants for 1092 individ- OMIM. OMIM currently contains information on approximately uals from diverse populations. It is estimated from the 1 kG data 10,200 nsSNPs associated with diseases (December 2011) and that each individual will, on average, differ from the reference Humsavar about 23,500 disease-associated nsSNPs. Most of the human genome sequence at 10,000–12,000 synonymous sites in phenotypical effects and their molecular origins are not well addition to 10,000–11,000 non-synonymous sites [1]. As these established, so predicting the functional effect of a single amino nsSNPs change the amino acid sequence of the protein, the acid variant is of great medical interest. The main methods assume changes have the potential to affect the structure and function of that mutations in highly conserved residues cause disease and thus, the corresponding proteins. The 1000 Genomes Project data set is by using alignments to homologous sequences and residue valuable in that it is large and not derived from a disease cohort similarity, the severity of the variant can be gauged. More but rather seeks to capture variants found in a disparate set of advanced methods include information derived from protein healthy individuals. This can be used to characterise differences on structures (such as solvent accessibility, free energy changes, average between disease-associated and benign mutations (or at environment specific substitution tables and functional annota- least mutations not known to be associated with disease) as well as tions) to improve the accuracy (see review by [7]). The advantage PLOS Computational Biology | www.ploscompbiol.org 1 December 2013 | Volume 9 | Issue 12 | e1003382 Amino Acid Mutation Characteristics substitution matrices [23,24] that incorporate structural con- Author Summary straints. Subramanian and Kumar [25] did a detailed analysis on a In this paper we compare the differences between ‘natural’ set of 8,627 disease-associated mutations and found that disease- and disease-associated amino acid variants at both associated mutations tend to occur on inter-species conserved sequence as well as structural levels. We used data from residues. The common factor between these studies is that they try the 1000 Genomes Project (1 kG), the OMIM database and to understand the effect that selection and structural constraints UniProtKB Humsavar. The results highlight the complex have on disease vs non-disease states in selected sets of proteins. interplay of features from the level of the DNA up to Very few studies have tried to unravel the underlying cause for protein sequence and structure. The codon CpG dinucle- mutation patterns seen in human proteins. With this work we aim otide content plays a large role in determining which to elucidate why certain amino acids mutate more and try to amino acids mutate. This in turn affects the mutability of understand the underlying mechanisms present in the mutation amino acids and a clear difference was seen between non- process. We gather the data for all the amino acid mutations found disease and disease variants where amino acids that are in the 1000 Genomes Project to characterise their sequence and naturally very mutable show the opposite trend in the structural properties, providing a benchmark background against disease-associated data. The current results show evidence which to compare the disease-associated nsSNPs in OMIM and for some selection, mainly in that the variants occur Humsavar. slightly more often on the surface of the protein and are much less likely to be annotated as functional than expected by chance. However we should note that even Results the best definition of functional, taken from structural The 1000 Genomes Project data were queried to retrieve all the data, is limited. Even with these caveats, it is clear that the nsSNPs, which were filtered to include only those that occurred in 1 kG variants eschew functional residues as defined here, a a single population (see methods). This ensures that only the more trend which is surprisingly even stronger in the OMIM data. recent mutational events in human evolution are included and simplifies counting. In addition variants at a single site were only counted once even if they occur in multiple individuals, since such clusters are assumed to represent a single variation event that has of using a 3D approach for prediction is that the consequence and been inherited in the other individuals. For 3D analysis only characteristics of the variant can be studied in its specific human proteins, for which complete structures are available, were environment in the protein. This provides a level of information included to ensure accurate analysis of 3D features. For solvent beyond a sequence or a sequence alignment [8]. If there are accessibility calculations, a monomer subset was also generated to ligands present, the interaction between the mutated amino acid avoid problems with uncertain multimeric states and validate our and the ligand can be studied. This has been successfully applied findings on the larger dataset. Homology models based on close to various individual proteins on a case-by-case basis [9,10]. In relatives were used to extend the data set and see if the trends total over 30 different programs to predict the effects of these observed in the experimental structures were preserved. Table 1 variants have been published, including Condel [11], SNAP [12], summarizes the five data sets created and used in this study. SDM [13], PolyPhen [14], VEP [15], SIFT [16,17] and SNP&GO [18]. Most of these algorithms can only predict whether a specific The amino acid exchange matrix derived from the 1000 variant will be neutral or deleterious for the protein with various degrees of accuracy, although measuring accuracy is challenging Genomes Project dataset in the absence of a good benchmark. Figure 1 shows the amino acid exchange matrix generated from To allow the accurate prediction of functional effects of SNPs, the ,106,000 nsSNPs found in the 1 kG data. Amino acid we need a thorough understanding of why amino acids mutate in mutations requiring two or three base changes are not defined in humans. Various groups have worked on the effect of the this dataset due to technical reasons. The 1 kG matrix exhibits mutations and numerous studies have been done on small specific several interesting features, most of which reflect the genetic code sets of proteins [8,19–22]. Blundell and co-workers have found and the differential mutability of various codons. All possible single that the local environment around an amino acid plays a large role base changes are observed. The matrix is not symmetrical as a in the effect that selection has on a mutation in a specific position result of the differences in frequency of occurrence of amino acids [21]. This has led to the development of environment specific as well as differences in their mutabilities [26,27]. As expected Table 1. The different datasets constructed and used in this study and their composition. Data set Protein chains nsSNPs Description 1 kG 19,058 106,311 A data set containing all the 1 kG variants filtered by population. OMIM 19,058 10,151 A protein sequence based set containing OMIM variants for all reviewed UniProt human proteins. Humsavar 19,058 23,846 A set based on human disease polymorphisms from UniProt. 3D 2,139 10,628 A protein 3D structure based set consisting of 1 kG variants for proteins that have a complete structure in the PDB. Monomer 325 1,461 A subset of the 3D set containing only proteins identified as being monomeric. Model 2,630 13,037 A set based on human ModBase homology models where sequence coverage and identity are between 90–100%. doi:10.1371/journal.pcbi.1003382.t001 PLOS Computational Biology | www.ploscompbiol.org 2 December 2013 | Volume 9 | Issue 12 | e1003382 Amino Acid Mutation Characteristics Figure 1. The amino acid exchanges observed in human protein variants. The 1 kG data set is the top row of each cell and OMIM the bottom row of each cell*. Amino acids are arranged by 1 letter code according to increasing hydrophobicity (least hydrophobic is left and most hydrophobic is right) using the Fauche`re and Pliska scale [58]. Yellow blocks indicate mutations where there are statistically significant differences between 1 kG and OMIM. Blue blocks indicate where no mutations were present in the 1 kG data set. White blocks show where there are no statistically significant differences. Green blocks show where there are proportionally more 1 kG mutations compared to OMIM. Orange blocks show where there are proportionally more OMIM mutations than 1 kG. The mutability scores (see methods) for the 1 kG and OMIM sets are shown in the last column. Note that these matrices are fundamentally different. The 1 kG data set gathers all the observed mutations in the 1 kG project, counting each only once; the OMIM data set combines information gathered from potentially many individuals but filtered to identify those mutations associated with a disease. doi:10.1371/journal.pcbi.1003382.g001 there is a strong correlation (r = 0.786) between the frequency of most mutable, whilst the more chemically complex amino acids, occurrence of amino acids in the human proteome and the Trp (0.004) and Phe (0.005) have the lowest mutabilities. There is number of associated codons. Figure 2 shows that, excluding Arg no correlation in the 1000 Genomes data between mutability and and Leu which are extreme outliers, there is a strong trend for frequency of occurrence (r =20.003 excluding Arg) nor between amino acids with a higher frequency of occurrence to have more mutability and the number of codons (Figure 3). It is well known mutations (r = 0.836). Taken together this leads to a relatively that CpG dinucleotides in DNA tend to mutate at rates 10–50 strong correlation (r = 0.741) between the number of codons and times higher than other dinucleotides [28,29] and thus amino the number of mutations. In contrast, the frequency of the gained acids with a CpG present in their codons will mutate with a higher amino acids, resulting from the mutation, shows little correlation probability (see Figure 4). Four out of the six codons for Arg between frequency of occurrence and number of mutations include CpG sequences, and Arg mutates more frequently than (r = 0.349). any other residue, with a mutability (0.031) which is over twice as high as its nearest rival. This high mutability also reflects the fact Amino acid mutabilities that the CpG in the Arg codons occur in the non-wobble positions The mutabilities of the amino acids (see methods) in the 1 kG so nucleotide mutations give rise to non-synonymous SNPs. In contrast Leu which also has six codons, none of which contain dataset are shown in the last column of Figure 1. Arg (0.031) is the PLOS Computational Biology | www.ploscompbiol.org 3 December 2013 | Volume 9 | Issue 12 | e1003382 Amino Acid Mutation Characteristics Figure 4. A visual representation of the asymmetry of the 1 kG Figure 2. Comparison of the number of mutating residues vs data. The plot shows the difference between how often an amino acid the amino acid frequency of occurrence. mutates vs how often it is mutated to. These are raw counts and also doi:10.1371/journal.pcbi.1003382.g002 reflect the frequency of occurrence. Each amino acid is coloured according to CpG content. Red: a CpG dinucleotide occurs in its codons; CpG, has a low mutability (0.005) and mutates six times less yellow: if one of its codons start with a G (with a C possibly preceding it); blue: no CpG in its codons. The black line indicates the diagonal frequently than Arg. However the correlation with CpG is far where ‘mutations to’ equals ‘mutations from’. from perfect and other factors must have an effect. For example, doi:10.1371/journal.pcbi.1003382.g004 Met, which has only one codon with no CpG dinucleotide, is the second most mutable amino acid (0.014). Figure 4 shows the clear pattern of amino acid gain and loss in different pattern of amino acid mutabilities, compared to the the human proteome. Jordan [26] and Zuckerkandl [30] long overall trend with correlation coefficients equal to 1.0 (Figure S1). since identified that Cys, Met, His, Ser and Phe are being accrued Using the individual amino acid mutabilites, we looked at significantly in the human proteome. Our data confirm a net gain aggregate protein mutability differences by adding up the of these five amino acids, and Val, Asn, Ile and Thr were also individual mutabilities for every amino acid in each protein in confirmed as weak gainers. Jordan and co-workers also identified the data set and normalising by protein length. This was compared strong losers and our data again confirm that Pro, Ala, Gly and to the aggregate mutabilities of proteins involved in disease as Glu are strong losers. Lys was identified as a weak loser but our classified by OMIM and Humsavar. The average score for larger dataset suggests that lysine should be considered a weak disease-associated proteins was 0.0103 and for non-disease gainer in humans. Arg is the strongest loser in the human genome proteins 0.0102 with a median of 0.01022 (s = 0.0006) and (similar to the human set in [26] but not other considered species). 0.01018 (s = 0.0005), respectively, indicating that protein aggre- We calculated the mutability for every amino acid on a gate mutability has no bearing on disease-association (Figure S2). population specific basis. None of the populations showed a The effects of physicochemical characteristics of the amino acids on their mutability As well as constraints on the mutational process at the DNA level, the consequence of a variant on the protein structure and function will also have an impact on the number of observed mutations. If a variant interferes with the structure and function of a protein and that protein is essential, then this variant is less likely to be seen. However comparison of mutability with the size and hydrophobicity of the amino acid shows very little correlation in the 1 kG dataset. There is a moderate anti-correlation between higher mutability and size (r =20.474), with the smaller amino acids mutating more frequently, but no correlation at all between mutability and hydrophobicity (r =20.082) although the large hydrophobic amino acids (Leu, Phe and Trp) have the lowest mutability scores. Trp has the fewest mutations (544, even though all SNPs in Trp codons result in a change of amino acid) and also the lowest mutability score (0.004) together with Phe. In addition to their complexity and low abundance, Phe and Trp often occur in specialized roles such as the interior of proteins, p-p stacking or ring interactions and this might add to their low mutability. The Figure 3. Amino acid mutability vs the number of codons in the mutability of Cys is also low, perhaps reflecting its role in 1 kG data. doi:10.1371/journal.pcbi.1003382.g003 disulphide bridges, which help to stabilise extracellular proteins. PLOS Computational Biology | www.ploscompbiol.org 4 December 2013 | Volume 9 | Issue 12 | e1003382 Amino Acid Mutation Characteristics The structural properties of 1000 Genomes variants Do natural mutations occur in functionally annotated To investigate the structural characteristics of these variants, residues? three sets of protein structures were compiled, namely the 3D set, Functional annotation for each human protein was derived the monomer set and the model set (Table 1). The 3D and using SAS (Sequence Annotated by Structure, [33]). Table 2 monomer set were constructed from data in the PDB (see methods) shows the different functional annotations for each set. The while the model set and the subsequent variant modelling was vast majority of functional annotations identified, make created and performed using Modbase [31] and Modeller [32], built contacts to ligands (using PDBsum data, [34]) or site into an in-house homology modelling pipeline. The 3D set contains interactions in the proteins (as defined in the PDB). Only 2,139 protein chains. A total of 10,628 1 kG nsSNPs were found in 15.5% of the mutations (1,648 of 10,628) in the 3D set were these chains, of which protein models, based on the known annotated with a function compared to 29.1% of all residues in structures of human proteins could be built for 5,524. The the set of human structures (Figure 5C). These data show that monomer set contains 325 protein chains identified as monomers the observed mutations in the 1000 Genomes occur less and a total of 1,461 1 kG nsSNPs were found, of which 897 could be frequently in the functionally annotated residues compared to modelled. The model set, including models based on homologues all residues. from the PDB, contained 2,630 protein chains and 12,432 out of 13,037 nsSNPs could be modelled. For the Humsavar set we found Residue conservation 5,592 nsSNPs of which 3,942 could be modelled. Residue conservation scores, defined as the variation of the Figure 5A shows a comparison of the solvent accessibility residues at a given site in the protein across multiple species, were distribution for all residues compared to that for the variants. On obtained for all sites in the human proteome (where sufficient data average the variants in the 1 kG are slightly more exposed. An are available) from the Evolutionary Trace server [35]. These analysis of the solvent exposed residues found that, for the most scores are distributed across the whole range of conservation accurate monomer set, 79% of nsSNPs are solvent exposed (Figure 6) with a mean score of 0.48. The scores for all the sites compared to 73% of all residues (p = 0.001). For the structures in with mutations in the 1000 Genomes data show a slightly different the model set, 81.9% of nsSNPs were solvent exposed. For all distribution from all residues, with a small but significant shift three datasets, the 1 kG variants have a slight preference to occur (p,2.2610 ) towards the less conserved sites and a reduced on the surface of proteins compared to all residues. Figure 5B mean conservation score of 0.43. Clearly natural variation occurs shows that there were no appreciable differences in secondary across all conservation levels and is not limited to non-conserved structure preferences between variants and other residues. residues. Figure 5. Site properties for all residues, 1 kG nsSNPs, OMIM nsSNPs and Humsavar nsSNPs in the structure 3D set. (A) the solvent accessibility for the variants in the four datasets, (B) the secondary structure in which each of the variants occurs, (C) the functional annotation of every variant in the four datasets. doi:10.1371/journal.pcbi.1003382.g005 PLOS Computational Biology | www.ploscompbiol.org 5 December 2013 | Volume 9 | Issue 12 | e1003382 Amino Acid Mutation Characteristics Table 2. The various functions assigned to nsSNPs in each set. Set Site Ligand Site/ligand overlap Metal Catalytic Overall (non-redundant) 3D 1,414 1,432 1,220 334 17 1,648 (15.5%) Monomer 281 273 245 83 4 312 (21.4%) OMIM 163 184 147 17 17 209 (2.1%) Humsavar 305 285 252 58 41 355 (51.2%) Models 1,538 1,443 1,304 376 36 1,676 (12.9%) ‘Site’ refers to residue specific annotations made by depositors of PDB structures, ‘Ligand’ refers to residues involved in binding a ligand, ‘Metal’ refers to residues coordinating with metals and ‘Catalytic’ to residues involved in the catalytic activity of the protein. The % of non-redundant assigned residues that are ‘functional’ is also shown. doi:10.1371/journal.pcbi.1003382.t002 shows the distribution of changes in energy of the whole protein Amino acid exchange characteristics in 1000 Genome caused by each mutation, evaluated as the statistical potential data energy DOPE score (Discrete Optimised Protein Energy) in For each amino acid the mutation profile can be calculated Modeller. 68.1% of the 1 kG variants increase the DOPE score showing the preference for specific X =.Y mutations in the 1000 (i.e. make the protein less stable). This implies that most natural Genomes data. These profiles, given for all the amino acids in variants decrease the stability of the protein, albeit by a very small Figure 7, show that there are striking differences in frequency of amount. The distribution of changes in size and hydrophobicity occurrence for the different exchanges. For example, in the 1 kG set Arg shows a strong preference to mutate to Gln and His, whilst for all observed mutations (Figure 8B and 8C) show that 59.4% of mutations to Ser, Gly and Pro are much less frequent. All the mutations increase the hydrophobicity of the amino acid and amino acids show these differential exchange rates. Figure 8A 52.4% of mutations increase the size. Over 84% of variants Figure 6. Comparison of the conservation scores in the four sets used. The density distribution of residue conservation scores for all the amino acid positions in UniProt (9,532,474 residues, black), 1 kG (185,428 residues, blue), OMIM (8,099 residues, red) and Humsavar (21,446 residues, green). The conservation scores range from 0 for non-conserved residues to 1 for highly conserved residues. doi:10.1371/journal.pcbi.1003382.g006 PLOS Computational Biology | www.ploscompbiol.org 6 December 2013 | Volume 9 | Issue 12 | e1003382 Amino Acid Mutation Characteristics Figure 7. Comparison of the differences in observed mutations in the various sets. Comparison of the differences in the % of observed mutations in the 1 kG (blue) and OMIM (red) sets for one amino acid mutating to all others e.g. proportionally, more mutations from Lys to Glu are recorded in OMIM than in the 1 kG set. Each plot shows the results of mutation from a specific amino acid (e.g. Arg at top left) to every other amino acid. doi:10.1371/journal.pcbi.1003382.g007 compare these inter-species matrices with the 1 kG intra-species change their size by less than 50 Da. 72% of variants change their hydrophobicity by less than 1 unit. Extreme changes are rare. At matrix (Figure 9A–C). The 1 kG matrix was built using data where the direction of the mutations is known whereas all other this stage these observations provide empirical expectation rates matrices were calculated assuming direction is unknown. This was for amino acid exchanges in humans and result from the genetic compared to the WAG [37] and PAM matrix [38]. To check that code, the nucleotide exchange rates and also some selection at the any differences between the 1 kG matrix and the other matrices protein level. However without a good random model it is difficult are not caused by using direction, a directionless matrix has also to be confident about the importance of the different contributions been included in the plot (Figure 9D). In this plot, principal to such variation. component one clearly separates the 1 kG matrices, which are placed very close together, from all of the previously calculated Comparison of 1000 Genome variants with those matrices. Principal component two then spreads matrices out predicted by the PAM and WAG mutation matrices based on whether the alignments used to build them are made up The 1 kG counts matrix is a snapshot of mutations that have mainly of exposed or buried domains, with the mitochondrial occurred in humans in a short period of time. To understand this matrices at the one extreme built from nearly all membrane process the count matrix can be converted into an instantaneous proteins, and matrices built from only exposed regions of proteins rate matrix describing the rates of change of each amino acid in at the other. humans in a time-independent manner [36]. Instantaneous rate A difference between the intra-species data and the inter-species matrices have previously been built from a wide selection of matrices is the amount of selection which has occurred. Due to the protein alignments across many species including nuclear proteins, time-scale for the 1 kG data and the relatively weak selection in mitochondrial proteins, chloroplast proteins, buried protein human populations [39,40] the only mutations which are not domains and exposed protein domains. PCA can be used to observed are lethal mutations. This means that there should be a PLOS Computational Biology | www.ploscompbiol.org 7 December 2013 | Volume 9 | Issue 12 | e1003382 Amino Acid Mutation Characteristics PLOS Computational Biology | www.ploscompbiol.org 8 December 2013 | Volume 9 | Issue 12 | e1003382 Amino Acid Mutation Characteristics Figure 8. Comparison between the physicochemical properties of the wildtype and the mutant models for each of the data sets. Plots showing the differences between (A) Modeller DOPE scores for the wild type and mutant model (based on 3D, 10,628 mutations, and Humsavar sets, 21,446 residues), (B) changes in hydrophobicity between wild type and mutant in both sets and (C) changes in size between wild type and mutation in both sets. doi:10.1371/journal.pcbi.1003382.g008 Figure 9. Bubble plots comparing the relative differences between the instantaneous rate change matrices of the data sets. (A) 1 kG data, (B) PAM matrix and (C) WAG matrix. (D) A PCA (first two components) plot showing the separation of the 1 kG matrices from other matrices. Matrices included are 1 kG (with and without assuming direction), nuclear (WAG, JTT, LG, PAM, tm126, PCMA), mitochondrial (mtREV24, mtMam, mtArt, mtZoa), chloroplast (cpREV, cpREV64), exposed (alpha helix, beta sheet, coil, turn) and buried (alpha helix, beta sheet, coil, turn). Principal components one and two represent 34% and 20% of the variance, respectively. All other principal components represent 9% or less of the variance each. Amino acids are arranged according to increasing hydrophobicity. doi:10.1371/journal.pcbi.1003382.g009 PLOS Computational Biology | www.ploscompbiol.org 9 December 2013 | Volume 9 | Issue 12 | e1003382 Amino Acid Mutation Characteristics limited effect of selection on the 1 kG matrix. By using no allele (,10,000), is approximately ten times smaller than the number of frequency cutoff for the minor alleles when building the count 1000 Genomes mutations. The normalised OMIM counts that matrix, we gather the maximum amount of information about the differ from the 1 kG dataset are coloured in Figure 1. Considering mutation process. The counts are necessarily shaped by mutation just the residue type, if we exclude Arg, the overall correlation between the normalised frequencies of occurrence of the mutated and selection but will mostly reflect the mutation process. The residues in the two datasets is only 0.14 and between 1 kG and inter-species matrices (e.g. PAM and WAG in Figure 9B,C) on the Humsavar it is 0.48. If we compare all 148 observed X =.Y other hand are subject to selection pressures. This could explain frequencies, the correlation between 1 kG and OMIM is 0.51 and why the 1 kG matrix is so different from the other matrices. One 1 kG and Humsavar is 0.79. clear factor is CpG hypermutability: for example, changes from Previous studies have found that mutations from Arg and Gly Arg, an amino acid with four of six codons containing a CpG, are the major contributors to human genetic disease and have have a very high rate in the 1 kG data, and not in WAG been shown to make up about 30% of the mutations involved in (Figure 9A,B). In fact only codons containing a CpG have high disease [41]. In this updated and much expanded set, variants rates overall (Figure 10). The most plausible explanation is that from Arg and Gly only make up 15% of the disease causing these CpG mutations are occurring at a very high rate and then mutations. However mutations to Arg are still the biggest are selected out so that the effect is not seen as strongly when contributor to genetic disease with ,19.4% of all mutations. looking across multiple species. Figure 11 shows a rank order comparison between the frequency of occurrence of the 1 kG and OMIM variants Comparison between the 1000 Genomes variants and (r = 0.09) as well as between 1 kG and Humsavar (r = 0.31) and the disease-associated variants Humsavar and OMIM (r = 0.51), normalised for amino acid For comparison, we have constructed the amino acid exchange occurrence. Unlike for the 1 kG data, the disease-associated counts matrix for data from the OMIM database and the associated variants show moderate inverse correlations between their plots for these mutations (Figures 1–8). Disease variants from the frequency and the frequency of occurrence of the residue type UniProtKB/Swiss-Prot Human polymorphisms and disease muta- (r =20.67) implying that, at least for OMIM, the mutations to the tions index (Humsavar) were also included with plots available in the rarer amino acids (with fewer codons) are more likely to be supplement (Figures S3, S4, S5). Our focus however is on the associated with disease. As with the 1 kG data there is no strong OMIM set. In contrast to the 1 kG data, various double and triple correlation between a residue type being associated with a disease base mutations are observed in the OMIM set, however the three in the OMIM data and the number of codons. For hydrophobicity triple base changes (Phe-Lys, Met-Tyr and Trp-Ile) were checked and size, the disease associated variants show the opposite trend to back to the publications and all were found to be errors either in the the 1 kG dataset with a moderate correlation between lower paper or in OMIM and were removed. 82 two base changes were frequency and smaller size (r = 0.528, excluding Cys and Trp) but found in OMIM and a few (10%) randomly selected changes were no correlation between frequency and hydrophobicity (r = 0.289). manually checked with no errors found. Clearly the OMIM data It is interesting to note that the least mutable amino acid in the are radically different from the 1000 Genome data, in that they are 1 kG data (Trp) turns out to be the residue whose mutation is most all independent observations of variable confidence and manually likely to result in disease in the OMIM variants and is highly determined by individual scientists. They only represent a small ranked in the Humsavar set. Trp, the largest amino acid, often fraction of disease-associated nsSNPs and the number of mutations occurs in specialized roles in proteins as does Cys, the second most frequent variant residue type in OMIM. Amino acids with a lower frequency of occurrence tend to be the more complex amino acids and are frequently found in specialized roles. Mutating them will result in the possible loss or alteration of protein function, hence the over-representation in OMIM and Humsavar. In a number of cases the OMIM and 1 kG variant preferences appear to behave in an opposite way from one another e.g. in Figure 7 Arg most frequently mutates to Gln in the 1000 Genomes and a variantion to Gly is much less common, whilst Arg to Gly is the most common variant in the OMIM dataset and a variation to Gln is rare. We observe a reasonable correlation between the OMIM and Humsavar mutabilities (r = 0.51), but some amino acids appear to behave completely differently in the two datasets. Gly and Ala are much more frequently mutated in the Humsavar set than in OMIM, whilst Gln, Lys and His have mutabilities in the Humsavar set similar to those observed in the 1 kG dataset and much smaller than those in OMIM. This may reflect the larger Humsavar dataset (but this seems unlikely since Gly and Ala are quite common amind acids), so these specific discrepancies may rather reflect the origins of mutations in the two separate datasets. Structural properties of disease-associated nsSNPs The disease-associated OMIM variants show a slight preference Figure 10. Dependence of mutation rates on the change in CpG for buried sites (33%) compared to all residues (27%) in the human status. Rates of change from codons were calculated similarly to the proteome (Figure 5A) is even stronger in the Humsavar data amino acid rate matrix [36], but on a 61 by 61 codon matrix. doi:10.1371/journal.pcbi.1003382.g010 (41%). This contrasts with the ‘natural’ variants of the 1 kG data, PLOS Computational Biology | www.ploscompbiol.org 10 December 2013 | Volume 9 | Issue 12 | e1003382 Amino Acid Mutation Characteristics Figure 11. Amino acid mutability rank order plot comparing the mutability scores for 1 kG, OMIM and Humsavar residues. The most mutable amino acids are at the top. Correlation coefficients for 1 kG vs OMIM, 1 kG vs Humsavar and OMIM vs Humsavar are 0.09, 0.17 and 0.51, respectively. doi:10.1371/journal.pcbi.1003382.g011 which show a decreased preference (18%) for the interior. Our and stability or are involved in as yet unidentified protein-protein work broadly agrees with a smaller study done by Gong and interfaces. Blundell [21] that showed 60–65% of disease associated nsSNPs are solvent exposed. We found an almost identical distribution of Conservation OMIM and Humsavar variants compared to all residues and the There is a clear difference in the conservation score distribution 1 kG variants between the different secondary structures between natural variants and the OMIM and Humsavar variants (Figure 5B). (Figure 6). The natural variants occur across the entire range of Figure 8A shows the differences in the DOPE scores [42] conservation but the OMIM and Humsavar variants show a peak calculated for each variant during the structural modelling process in the more conserved residues. This is consistent with the idea for the 1 kG, OMIM and Humsavar datasets. The distribution for that mutations in conserved residues often lead to disease. the disease-associated variants is shifted towards larger positive energies in both datasets, indicating that the variants destabilize Discussion the protein slightly more than the non-disease variants. In contrast to the 1 kG data, OMIM mutations are more likely to increase The results presented herein are subject to a few caveats, the polarity (54%) and more likely to decrease size (51.6%, most serious being related to the limited and possibly biased Figure 8B,C). The two datasets show some detailed differences disease-associated data in OMIM. There are only ,10,000 in size and hydrophobicity changes. The Humsavar variants less variants in our OMIM set and these have variable experimental frequently reduce size or decrease hydrophobicity compared to validation, and may indeed be biased according to scientists’ OMIM mutations. preconceptions that such mutations should correspond to the residues that are most conserved and the amino acid exchanges that generate the largest changes in physicochemical characteris- Functional annotations In the OMIM set, 11.2% (209 of 1,864) of the modelled tics. The Humsavar set has over 23,000 disease variants, however the requirements for inclusion are based on an annotation of mutations were annotated with a function (Figure 5C and methods). This is less than the distribution for all residues ‘involvement in disease’. This annotation is derived from either OMIM annotations or associations found in literature during (29.1%) and that seen for the 1 kG variants (15.5%). For the Humsavar data this drops to only 6.5%. This is a surprising curation of the SwissProt data. Notwithstanding, the OMIM dataset is one of the best available at the present time, although the finding, which needs further validation. It implies that most disease-associated mutations do not have a direct effect on the coming years will see major expansion and hopefully improve- proteins’ catalytic or binding sites but instead act through other, ments in such data. The results highlight the complex interplay of unannotated residues such as those which affect overall structure features from the level of the DNA up to protein sequence and PLOS Computational Biology | www.ploscompbiol.org 11 December 2013 | Volume 9 | Issue 12 | e1003382 Amino Acid Mutation Characteristics structure. The codon CpG dinucleotide content plays a large role There is a small but significant impact of the protein structure in determining which amino acids mutate. This in turn affects the on amino acid mutability, so that natural variants occur slightly mutability of amino acids and a clear difference was seen between more often in non-conserved regions. 59.4% of variations increase non-disease and disease variants where amino acids that are the hydrophobicity of the amino acid and 52.4% increase its size naturally very mutable, show the opposite trend in the disease- in the natural set, while OMIM variants often result in larger associated data. changes in the size and hydrophobicity of the amino acid and are more destabilising on average than 1 kG variants. The Humsavar The data for the 1000 Genomes provides a new experimental baseline against which amino acid profiles may be compared. data supports this idea that disease variants result in more extreme changes. The selection pressures captured in the WAG and PAM Although there might be sequencing biases due to the DNA sequencing techologies used [43], every effort has been made by matrices ‘purify’ out the ‘natural’ variants, removing variants with large changes in size and hydrophobicity. The amino acids all the 1000 Genomes consortium to correct for this. They estimate that using consensus calling on data produced by multiple show distinctive exchange profiles, whereby some exchanges are platforms results in an error rate of 1–4%, thus having a small very common and some very rare, which provides an empirical but negligible impact on our results. The current results show expectation for any specific exchange in humans. evidence for some protein selection, mainly in that the variants As the cost of sequencing drops rapidly, many more genomes occur slightly more often on the surface of the protein and are will be sequenced and experimental validation of disease-causing much less likely to be annotated as functional than expected by mutations will improve as a result of more data. Much better chance. However, we should note that even the best definition of codon-based models of evolution will be attainable, allowing in functional, taken from structural data, is limited. At one level, the turn a better dissection of the impact of selection at the protein definition is rather broad. For example, all residues in contact with level. The data herein will be used to develop an improved method a ligand are described as functional, but this is a major to predict the effects of individual mutations, to explore cancer- underestimate since many cognate ligands are not present in the related amino acid mutations, to investigate and compare crystal structures and similarly protein-protein interactions are mutational profiles in different organisms as well as improving rarely captured. In addition there are still relatively few complete codon mutation models for human DNA. structures for human proteins, which makes analysis of the effects of variants more difficult. Methods Even with these caveats, it is clear that the 1 kG variants Non-synonymous mutations in humans eschew functional residues as defined here, a trend which is surprisingly even stronger in the OMIM and Humsavar data. UniProt [5] was queried for all reviewed protein sequences belonging to Homo sapiens. 19,058 entries were retrieved. The The preference for OMIM mutations to be more buried and less functional supports the suggestion that these variants predomi- Ensembl transcript ID [45] was obtained for each protein sequence using the mapping provided by UniProt (17,708 UniProt nantly affect the structure and stability of the protein [4]. This is a similar result to that found by Sunyaev and co-workers [44] on a entries were mapped to 40,351 Ensembl transcript IDs). Immu- noglobulins and major histocompatibility complex proteins were much smaller set. They found that 35% of disease variants were buried and a more detailed analysis found that ,70% of the excluded as they are inherently variable. For every protein, the variants are located in structurally and functionally important Ensembl v67 Perl API was used to query the transcript ID in regions. Therefore these disease-associated mutations may well Ensembl for nsSNPs found in the 1 kG data set (as available on 1 target residues that are remote from the active site, which August 2012). To reduce the inherent uncertainty involved in determining the ancestral allele, only mutations that occurred in modulate rather than obliterate the function of the protein. For example, for an enzyme, the primary catalytic residues are rarely one of the 1000 Genomes described populations were used, with the allele present in all populations considered the ancestral, hence targeted, but the ‘secondary’ residues in the interior (which affect stability) or on the surface, which may affect protein-protein defining the direction of the mutation. This increases the chances that the variant found in the 1 kG data is a mutation away from interactions, could modulate function. However, the higher than average conservation scores for OMIM and Humsavar sites the ancestral genome. 106,311 mutations were found and this data set, containing the ‘natural’ variants found in the 1 kG project, will suggest that these disease-associated residues, although not defined as ‘functional’, are still important for the organism. This be referred to as the 1 kG set. needs further investigation, with particular attention to how Residue conservation scores for each residue in every protein ‘functional’ residues are defined and whether we can improve on sequence were calculated using the Evolutionary Trace server this definition. [35]. Conservation scores for 2,274 sequences could not be calculated due to the methodology used by the Evolutionary Trace Bringing together all the above observations for disease- associated and natural variants in 1000 humans, we observe server that disregards residues in columns of the multiple alignment containing more than 60% gaps and ranked as being that the mutability of amino acids is largely driven by the properties of the DNA and mutational mechanisms, which favour non-conserved, as well as residues judged by the algorithm not to mutations at codons containing a CpG dinucleotide. Therefore have enough information. This process almost certainly preferen- mutations to Arg residues are more than twice as common as any tially excludes surface residues (where insertions and deletions are other mutation. However there are clearly other factors at play, most common) but since we are using the conservation distribution which determine the frequency of variants, even at the DNA level. for comparisons, this bias is not significant. The UniProt sequences Although the disease-associated variants (both OMIM and were used to calculate the relative abundance of amino acids in Humsavar) follow the same pattern as the 1 kG variants (i.e. the human proteins. A total of about 10.5 million amino acids were same mutations are present in both sets, as dictated by the genetic counted. For each protein sequence, the OMIM Mutations search code), the rank order of amino acids, according to their probability tool (http://www.bioinf.org.uk/omim) was queried with the of being disease-associated, is radically different from that UniProt entry ID to retrieve variants found in OMIM. Only expected on the basis of the 1 kG data, with some of the rarer variants for which the correct amino acid position in the protein amino acids being shifted to the top of the list. has been verified, were used for the OMIM data set and will be PLOS Computational Biology | www.ploscompbiol.org 12 December 2013 | Volume 9 | Issue 12 | e1003382 Amino Acid Mutation Characteristics referred to as the OMIM set. 556 of the OMIM mutations were each of the proteins in the structure data sets were retrieved from the PDB. To maintain consistency between the PDB and UniProt found in the 1 kG set (0.5%). Although these represent a very small fraction we removed them so that they did not bias the residue numbering, the SIFTS mapping [57] for each protein chain was used. NACCESS was used to calculate the relative results. solvent accessibilities for the individual residues in a chain. A cut- The instantaneous rate change matrices were derived using the off of 5% solvent exposure was used to distinguish between buried DCFreq method [36] and the human proteome frequencies. and exposed residues. Mutability of amino acids Mapping nsSNPs to structures A mutability score for every amino acid was calculated by taking To investigate the effect a nsSNP might have, each individual the total number of mutations for a specific amino acid in the data nsSNP was mapped to its correct amino acid in the protein and dividing by the frequency of occurrence for the specific amino structure. For every such nsSNP that could be mapped, a acid in the human genome. The proportional representation of homology model of the protein containing the nsSNP was built each amino acid in the human proteome is given in supplemental using Modeller 9v3 [32] with the original protein structure serving Table S1. as the template. A maximum of 200 steps of conjugate gradient minimization followed by 200 rounds of molecular dynamics at Statistical validation 300 K (using Modeller) was applied to each variant and its We compared the amino acid variant counts in the 1 kG and structural context analysed. NACCESS was run on all the variant OMIM data using Fischer’s exact test in the R package (R models to identify changes in solvent accessibility. Comparisons of Development Core Team, 2011). Multiple comparison correction the Modeller DOPE score (Discrete Optimized Protein Energy, was done on the p-values for each amino acid using p.adjust in R [42]) were made between the nsSNP model and the reference with the Benjamini-Hochberg-Yekutieli method [46,47]. P-values structure to estimate the magnitude of change that a variant might lower than 0.01 were considered statistically significant. For cause. The 1 kG models are available in PDBsum (http://www. correlation values, r.0.7 and r,20.7 were considered strong, ebi.ac.uk/pdbsum/) by looking at the specific PDB code of 0.4,r,0.7 and 20.4.r.20.7 were considered moderate and interest. 0.3.r.20.3 weak or no correlation. Supporting Information Retrieving human proteins and their structures The protein structure data set was constructed by first taking all Figure S1 Mutabilities of the amino acids for each the above mentioned protein sequences and annotating each with population. AMR: American admixed, ASN: South East their respective Pfam [48] domains. Only proteins for which there Asian, AFR:African, EUR: European. were matching entries in the Protein Data Bank (PDB, [49]) were (EPS) kept. This resulted in a list containing the UniProt identifiers for all Figure S2 The distribution of average protein mutabil- known human proteins that have at least one structure in the PDB. ites for all human proteins (blue) and disease associated For accuracy, the corresponding PDB structures were then filtered proteins (red). to include only X-ray structures. Using the Pfam mapping, only (EPS) protein structures containing all the protein’s Pfam domains were kept. The final list contained 2,139 protein chains and will be Figure S3 The amino acid exchanges observed in referred to as the 3D set. human protein variants. The 1 kG data set is the top row A set consisting only of human monomeric proteins was also of each cell and Humsvar(SP) the bottom row of each cell*. Amino acids are arranged by 1 letter code according to increasing constructed. An algorithm was implemented whereby a protein hydrophobicity (least hydrophobic is left and most hydrophobic is was classified as being either a multimer or a monomer based on a right) using the Fauche`re and Pliska scale. Yellow blocks indicate majority vote. The predictions used were from PISA [50], mutations where there are statistically significant differences UniProt, 3DComplex [51], PIQSI [52], PQS-PITA [53–55], between 1 kG and Humsavar. Blue blocks indicate where no relevant PubMed abstracts and REMARK 350 records from the mutations were present in the 1 kG data set. White blocks show PDB structure file. The oligomeric predictions from each of the where there are no statistically significant differences. Green blocks servers were collected for every protein in the 3D set. Only when show where there are proportionally more 1 kG mutations the majority of the servers agreed on the most probable oligomeric compared to Humsavar. Orange blocks show where there are state of the protein, was it designated as either a multimer or a proportionally more Humsavar mutations than 1 kG. The monomer. The monomeric protein list contained 325 proteins and mutability scores (see methods) for the 1 kG and Humsavar sets will be referred to as the monomer set. are shown in the last column. *Note that these matrices are Another homology-based set was constructed using the human fundamentally different. The 1 kG data set gathers all the models in ModBase [31]. Models with 90–100% sequence identity observed mutations in the 1 kG project, counting each only once; and coverage were used as templates. This set contained 2,630 the Humsavar data set combines information gathered from models and will be referred to as the model set. potentially many individuals but filtered to identify those mutations associated with a disease. Protein chain annotation (EPS) Each protein chain in the 3D, monomer and model sets was annotated with information from various databases and online Figure S4 Comparison of the differences in observed resources. Information about protein properties such as catalytic mutations in the various sets. Comparison of the differences residues, metal-binding residues, ligand-binding residues and in the % of observed mutations in the 1 kG (blue) and Humsavar PROSITE patterns [56] were extracted from PDBsum [34] and (red) sets for one amino acid mutating to all others e.g. additional functional residue annotations were retrieved using SAS proportionally, more mutations from Lys to Glu are recorded in (Sequence Annotated by Structure, [33]). The 3D coordinates for Humsavar than in the 1 kG set. Each plot shows the results of PLOS Computational Biology | www.ploscompbiol.org 13 December 2013 | Volume 9 | Issue 12 | e1003382 Amino Acid Mutation Characteristics mutation from a specific amino acid (e.g. Arg at top left) to every Acknowledgments other amino acid. We would like to thank Angela Wilkins for running the large scale (EPS) conservation analysis, Grecia Lapizco-Encinas for constructing the Figure S5 Comparison of the differences in observed monomer set, Arjun Ray for doing the analysis of the ModBase models and Ewan Birney for valuable discussions. mutations in the various sets. Comparison of the differences in the % of observed mutations in the Humsavar (green) and OMIM (red) sets for one amino acid mutating to all others. Each Author Contributions plot shows the results of mutation from a specific amino acid (e.g. Conceived and designed the experiments: TAPdB RAL SLP BS NG JMT. Arg at top left) to every other amino acid. Performed the experiments: TAPdB RAL. Analyzed the data: TAPdB SLP (EPS) BS. Wrote the paper: TAPdB RAL JMT. Valuable discussion regarding the method design: NG JMT SLP BS. Table S1 The relative abundances of the various amino acids in the UniProt protein set. (PDF) References 1. 1000 Genomes Project Consortium, Durbin RM, Abecasis GR, Altshuler DL, 25. Subramanian S, Kumar S (2006) Evolutionary anatomies of positions and types Auton A, et al (2010) A map of human genome variation from population-scale of diseaseassociated and neutral amino acid mutations in the human genome. sequencing. Nature 467: 1061–1073. BMC Genomics 7: 306. 2. 1000 Genomes Project Consortium, Abecasis GR, Auton A, Brooks LD, 26. Jordan IK, Kondrashov FA, Adzhubei IA, Wolf YI, Koonin EV, et al. (2005) A DePristo MA, et al (2012) An integrated map of genetic variation from 1,092 universal trend of amino acid gain and loss in protein evolution. Nature 433: human genomes. Nature 491: 56–65. 633–638. 3. Iengar P (2012) An analysis of substitution, deletion and insertion mutations in 27. Hurst LD, Feil EJ, Rocha EPC (2006) Protein evolution: causes of trends in cancer genes. Nucleic Acids Res 40: 6401–6413. amino-acid gain and loss. Nature 442: E11–2; discussion E12. 4. Amberger J, Bocchini CA, Scott AF, Hamosh A (2009) McKusick’s Online 28. Walser JC, Furano AV (2010) The mutational spectrum of non-CpG DNA Mendelian Inheritance in Man (OMIM). Nucleic Acids Res 37: D793–D796. varies with CpG content. Genome Res 20: 875–882. 5. UniProt-Consortium (2010) The Universal Protein Resource (UniProt) in 2010. 29. Kong A, Frigge ML, Masson G, Besenbacher S, Sulem P, et al. (2012) Rate of de Nucleic Acids Res 38: D142–D148. novo mutations and the importance of father’s age to disease risk. Nature 488: 6. Stenson PD, Ball E, Howells K, Phillips A, Mort M, et al. (2008) Human Gene 471–475. Mutation Database: towards a comprehensive central mutation database. J Med 30. Zuckerkandl E, Derancourt J, Vogel H (1971) Mutational trends and random Genet 45: 124–126. processes in the evolution of informational macromolecules. J Mol Biol 59: 473– 7. Ng PC, Henikoff S (2006) Predicting the effects of amino acid substitutions on protein function. Annu Rev Genomics Hum Genet 7: 61–80. 31. Pieper U, Webb BM, Barkan DT, Schneidman-Duhovny D, Schlessinger A, et 8. Steward RE, MacArthur MW, Laskowski RA, Thornton JM (2003) Molecular al. (2011) ModBase, a database of annotated comparative protein structure basis of inherited diseases: a structural perspective. Trends Genet 19: 505–513. models, and associated resources. Nucleic Acids Res 39: D465–D474. 9. Fabre KM, Ramaiah L, Dregalla RC, Desaintes C, Weil MM, et al. (2011) 32. Sali A, Blundell TL (1993) Comparative protein modelling by satisfaction of Murine Prkdc polymorphisms impact DNA-PKcs function. Radiat Res 175: spatial restraints. J Mol Biol 234: 779–815. 493–500. 33. Milburn D, Laskowski RA, Thornton JM (1998) Sequences annotated by 10. Minutolo C, Nadra AD, Ferna´ndez C, Taboas M, Buzzalino N, et al. (2011) structure: a tool to facilitate the use of structural information in sequence Structure-based analysis of five novel disease-causing mutations in 21- analysis. Protein Eng 11: 855–859. hydroxylase-deficient patients. PLoS One 6: e15899. 34. Laskowski RA (2009) PDBsum new things. Nucleic Acids Res 37: D355–D359. 11. Gonza´lez-Pe´rez A, Lo´pez-Bigas N (2011) Improving the assessment of the 35. Mihalek I, Res I, Lichtarge O (2004) A family of evolution-entropy hybrid outcome of nonsynonymous SNVs with a consensus deleteriousness score, methods for ranking protein residues by importance. J Mol Biol 336: 1265–1282. Condel. Am J Hum Genet 88: 440–449. 36. Kosiol C, Goldman N (2005) Different versions of the Dayhoff rate matrix. Mol 12. Bromberg Y, Yachdav G, Rost B (2008) SNAP predicts effect of mutations on Biol Evol 22: 193–199. protein function. Bioinformatics 24: 2397–2398. 37. Whelan S, Goldman N (2001) A general empirical model of protein evolution 13. Worth CL, Preissner R, Blundell TL (2011) Sdm–a server for predicting effects derived from multiple protein families using a maximum-likelihood approach. of mutations on protein stability and malfunction. Nucleic Acids Res 39: W215– Mol Biol Evol 18: 691–699. W222. 38. Dayhoff M, Schwartz R, Orcutt B (1978) A model of evolutionary change in 14. Adzhubei IA, Schmidt S, Peshkin L, Ramensky VE, Gerasimova A, et al. (2010) proteins. Atlas of Protein Sequence and Structure 5(3): 345–351. A method and server for predicting damaging missense mutations. Nat Methods 39. Lohmueller KE, Indap AR, Schmidt S, Boyko AR, Hernandez RD, et al. (2008) 7: 248–249. Proportionally more deleterious genetic variation in European than in African 15. McLaren W, Pritchard B, Rios D, Chen Y, Flicek P, et al. (2010) Deriving the populations. Nature 451: 994–997. consequences of genomic variants with the Ensembl API and SNP Effect 40. Akashi H, Osada N, Ohta T (2012) Weak selection and protein evolution. Predictor. Bioinformatics 26: 2069–2070. Genetics 192: 15–31. 16. Ng PC, Henikoff S (2001) Predicting deleterious amino acid substitutions. 41. Vitkup D, Sander C, Church GM (2003) The amino-acid mutational spectrum Genome Res 11: 863–874. of human genetic disease. Genome Biol 4: R72. 17. Ng PC, Henikoff S (2003) SIFT: Predicting amino acid changes that affect 42. Shen MY, Sali A (2006) Statistical potential for assessment and prediction of protein function. Nucleic Acids Res 31: 3812–3814. protein structures. Protein Sci 15: 2507–2524. 18. Calabrese R, Capriotti E, Fariselli P,Martelli PL, Casadio R (2009) Functional 43. Ross MG, Russ C, Costello M, Hollinger A, Lennon NJ, et al. (2013) annotations improve the predictive score of human disease-related mutations in Characterizing and measuring bias in sequence data. Genome Biol 14: R51. proteins. Hum Mutat 30: 1237–1244. 44. Sunyaev S, Ramensky V, Bork P (2000) Towards a structural basis of human 19. Nakken S, Alseth I, Rognes T (2007) Computational prediction of the effects of non-synonymous single nucleotide polymorphisms. Trends Genet 16: 198–200. non-synonymous single nucleotide polymorphisms in human DNA repair genes. 45. Flicek P, Aken BL, Ballester B, Beal K, Bragin E, et al. (2010) Ensembl’s 10th Neuroscience 145: 1273–1279. year. Nucleic Acids Res 38: D557–D562. 20. Reumers J, Schymkowitz J, Rousseau F (2009) Using structural bioinformatics to 46. Hochberg YBY (1995) Controlling the false discovery rate: A practical and investigate the impact of non synonymous SNPs and disease mutations: scope powerful approach to multiple testing. J Roy Statistical Society 57(1): 289–300. and limitations. BMC Bioinformatics 10 Suppl 8: S9. 47. Yekutieli YBD (2001) The control of the false discovery rate in multiple testing 21. Gong S, Blundell TL (2010) Structural and functional restraints on the under dependency. Ann Stat 29(4): 1165–1188. occurrence of single amino acid variations in human proteins. PLoS One 5: 48. Finn RD, Mistry J, Tate J, Coggill P, Heger A, et al. (2010) The Pfam protein e9186. families database. Nucleic Acids Res 38: D211–D222. 22. Kamaraj B, Purohit R (2013) Computational screening of disease-associated mutations in OCA2 gene. Cell Biochem Biophys: 1–13. 49. Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, et al. (2000) The Protein Data Bank. Nucleic Acids Res 28: 235–242. 23. Gong S, Worth CL, Bickerton GRJ, Lee S, Tanramluk D, et al. (2009) 50. Krissinel E, Henrick K (2007) Inference of macromolecular assemblies from Structural and functional restraints in the evolution of protein families and crystalline state. J Mol Biol 372: 774–797. superfamilies. Biochem Soc Trans 37: 727–733. 24. Worth CL, Gong S, Blundell TL (2009) Structural and functional constraints in 51. Levy ED, Pereira-Leal JB, Chothia C, Teichmann SA (2006) 3D complex: a the evolution of protein families. Nat Rev Mol Cell Biol 10: 709–720. structural classification of protein complexes. PLoS Comput Biol 2: e155. PLOS Computational Biology | www.ploscompbiol.org 14 December 2013 | Volume 9 | Issue 12 | e1003382 Amino Acid Mutation Characteristics 52. Levy ED (2007) PiQSi: protein quaternary structure investigation. Structure 15: 56. Sigrist CJA, Cerutti L, Hulo N, Gattiker A, Falquet L, et al. (2002) PROSITE: a 1364–1367. documented database using patterns and profiles as motif descriptors. Brief 53. Ponstingl H, Henrick K, Thornton JM (2000) Discriminating between Bioinform 3: 265–274. homodimeric and monomeric proteins in the crystalline state. Proteins 41: 57. Velankar S, McNeil P, Mittard-Runte V, Suarez A, Barrell D, et al. (2005) E- 47–57. MSD: an integrated data resource for bioinformatics. Nucleic Acids Res 33: 54. Henrick K, Thornton JM (1998) PQS: a protein quaternary structure file server. D262–D265. Trends Biochem Sci 23: 358–361. 58. Fauche`re JL, Charton M, Kier LB, Verloop A, Pliska V (1988) Amino acid side 55. Ponstingl H, Kabir T, Gorse D, Thornton JM (2005) Morphological aspects of chain parameters for correlation studies in biology and pharmacology. Int J Pept oligomeric protein structures. Prog Biophys Mol Biol 89: 9–35. Protein Res 32: 269–278. PLOS Computational Biology | www.ploscompbiol.org 15 December 2013 | Volume 9 | Issue 12 | e1003382 http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png PLoS Computational Biology Public Library of Science (PLoS) Journal http://www.deepdyve.com/lp/public-library-of-science-plos-journal/amino-acid-changes-in-disease-associated-variants-differ-radically-H2pF00LijX

Loading next page...

References (122)

( HurstLD, FeilEJ, RochaEPC (2006) Protein evolution: causes of trends in amino-acid gain and loss. Nature 442: E11–2 discussion E12.16929253)
HurstLD, FeilEJ, RochaEPC (2006) Protein evolution: causes of trends in amino-acid gain and loss. Nature 442: E11–2 discussion E12.16929253
HurstLD, FeilEJ, RochaEPC (2006) Protein evolution: causes of trends in amino-acid gain and loss. Nature 442: E11–2 discussion E12.16929253, HurstLD, FeilEJ, RochaEPC (2006) Protein evolution: causes of trends in amino-acid gain and loss. Nature 442: E11–2 discussion E12.16929253
( FlicekP, AkenBL, BallesterB, BealK, BraginE, et al (2010) Ensembl's 10th year. Nucleic Acids Res 38: D557–D562.19906699)
FlicekP, AkenBL, BallesterB, BealK, BraginE, et al (2010) Ensembl's 10th year. Nucleic Acids Res 38: D557–D562.19906699
FlicekP, AkenBL, BallesterB, BealK, BraginE, et al (2010) Ensembl's 10th year. Nucleic Acids Res 38: D557–D562.19906699, FlicekP, AkenBL, BallesterB, BealK, BraginE, et al (2010) Ensembl's 10th year. Nucleic Acids Res 38: D557–D562.19906699
S. Sunyaev, V. Ramensky, P. Bork (2000)
Towards a structural basis of human non-synonymous single nucleotide polymorphisms.
Trends in genetics : TIG, 16 5
C. Worth, Sungsam Gong, T. Blundell (2009)
Structural and functional constraints in the evolution of protein families
Nature Reviews Molecular Cell Biology, 10
H. Ponstingl, T. Kabir, D. Gorse, J. Thornton (2005)
Morphological aspects of oligomeric protein structures.
Progress in biophysics and molecular biology, 89 1
W McLaren (2010)
Deriving the consequences of genomic variants with the Ensembl API and SNP Effect Predictor
Bioinformatics, 26
HM Berman (2000)
The Protein Data Bank
Nucleic Acids Res, 28
( UniProt-Consortium (2010) The Universal Protein Resource (UniProt) in 2010. Nucleic Acids Res 38: D142–D148.19843607)
UniProt-Consortium (2010) The Universal Protein Resource (UniProt) in 2010. Nucleic Acids Res 38: D142–D148.19843607
UniProt-Consortium (2010) The Universal Protein Resource (UniProt) in 2010. Nucleic Acids Res 38: D142–D148.19843607, UniProt-Consortium (2010) The Universal Protein Resource (UniProt) in 2010. Nucleic Acids Res 38: D142–D148.19843607
P. Ng, S. Henikoff (2006)
Predicting the effects of amino acid substitutions on protein function.
Annual review of genomics and human genetics, 7
( FinnRD, MistryJ, TateJ, CoggillP, HegerA, et al (2010) The Pfam protein families database. Nucleic Acids Res 38: D211–D222.19920124)
FinnRD, MistryJ, TateJ, CoggillP, HegerA, et al (2010) The Pfam protein families database. Nucleic Acids Res 38: D211–D222.19920124
FinnRD, MistryJ, TateJ, CoggillP, HegerA, et al (2010) The Pfam protein families database. Nucleic Acids Res 38: D211–D222.19920124, FinnRD, MistryJ, TateJ, CoggillP, HegerA, et al (2010) The Pfam protein families database. Nucleic Acids Res 38: D211–D222.19920124
( LevyED (2007) PiQSi: protein quaternary structure investigation. Structure 15: 1364–1367.17997962)
LevyED (2007) PiQSi: protein quaternary structure investigation. Structure 15: 1364–1367.17997962
LevyED (2007) PiQSi: protein quaternary structure investigation. Structure 15: 1364–1367.17997962, LevyED (2007) PiQSi: protein quaternary structure investigation. Structure 15: 1364–1367.17997962
P. Stenson, E. Ball, K. Howells, A. Phillips, M. Mort, D. Cooper (2007)
Human Gene Mutation Database: towards a comprehensive central mutation database
Journal of Medical Genetics, 45
( MihalekI, ResI, LichtargeO (2004) A family of evolution-entropy hybrid methods for ranking protein residues by importance. J Mol Biol 336: 1265–1282.15037084)
MihalekI, ResI, LichtargeO (2004) A family of evolution-entropy hybrid methods for ranking protein residues by importance. J Mol Biol 336: 1265–1282.15037084
MihalekI, ResI, LichtargeO (2004) A family of evolution-entropy hybrid methods for ranking protein residues by importance. J Mol Biol 336: 1265–1282.15037084, MihalekI, ResI, LichtargeO (2004) A family of evolution-entropy hybrid methods for ranking protein residues by importance. J Mol Biol 336: 1265–1282.15037084
( WorthCL, GongS, BlundellTL (2009) Structural and functional constraints in the evolution of protein families. Nat Rev Mol Cell Biol 10: 709–720.19756040)
WorthCL, GongS, BlundellTL (2009) Structural and functional constraints in the evolution of protein families. Nat Rev Mol Cell Biol 10: 709–720.19756040
WorthCL, GongS, BlundellTL (2009) Structural and functional constraints in the evolution of protein families. Nat Rev Mol Cell Biol 10: 709–720.19756040, WorthCL, GongS, BlundellTL (2009) Structural and functional constraints in the evolution of protein families. Nat Rev Mol Cell Biol 10: 709–720.19756040
( KosiolC, GoldmanN (2005) Different versions of the Dayhoff rate matrix. Mol Biol Evol 22: 193–199.15483331)
KosiolC, GoldmanN (2005) Different versions of the Dayhoff rate matrix. Mol Biol Evol 22: 193–199.15483331
KosiolC, GoldmanN (2005) Different versions of the Dayhoff rate matrix. Mol Biol Evol 22: 193–199.15483331, KosiolC, GoldmanN (2005) Different versions of the Dayhoff rate matrix. Mol Biol Evol 22: 193–199.15483331
H. Ponstingl, K. Henrick, J. Thornton (2000)
Discriminating between homodimeric and monomeric proteins in the crystalline state
Proteins: Structure, 41
B. Kamaraj, R. Purohit (2013)
Computational Screening of Disease-Associated Mutations in OCA2 Gene
Cell Biochemistry and Biophysics, 68
( KrissinelE, HenrickK (2007) Inference of macromolecular assemblies from crystalline state. J Mol Biol 372: 774–797.17681537)
KrissinelE, HenrickK (2007) Inference of macromolecular assemblies from crystalline state. J Mol Biol 372: 774–797.17681537
KrissinelE, HenrickK (2007) Inference of macromolecular assemblies from crystalline state. J Mol Biol 372: 774–797.17681537, KrissinelE, HenrickK (2007) Inference of macromolecular assemblies from crystalline state. J Mol Biol 372: 774–797.17681537
( LaskowskiRA (2009) PDBsum new things. Nucleic Acids Res 37: D355–D359.18996896)
LaskowskiRA (2009) PDBsum new things. Nucleic Acids Res 37: D355–D359.18996896
LaskowskiRA (2009) PDBsum new things. Nucleic Acids Res 37: D355–D359.18996896, LaskowskiRA (2009) PDBsum new things. Nucleic Acids Res 37: D355–D359.18996896
I. Adzhubei, S. Schmidt, L. Peshkin, V. Ramensky, A. Gerasimova, P. Bork, A. Kondrashov, S. Sunyaev (2010)
A method and server for predicting damaging missense mutations
Nature methods, 7
( BrombergY, YachdavG, RostB (2008) SNAP predicts effect of mutations on protein function. Bioinformatics 24: 2397–2398.18757876)
BrombergY, YachdavG, RostB (2008) SNAP predicts effect of mutations on protein function. Bioinformatics 24: 2397–2398.18757876
BrombergY, YachdavG, RostB (2008) SNAP predicts effect of mutations on protein function. Bioinformatics 24: 2397–2398.18757876, BrombergY, YachdavG, RostB (2008) SNAP predicts effect of mutations on protein function. Bioinformatics 24: 2397–2398.18757876
( FauchèreJL, ChartonM, KierLB, VerloopA, PliskaV (1988) Amino acid side chain parameters for correlation studies in biology and pharmacology. Int J Pept Protein Res 32: 269–278.3209351)
FauchèreJL, ChartonM, KierLB, VerloopA, PliskaV (1988) Amino acid side chain parameters for correlation studies in biology and pharmacology. Int J Pept Protein Res 32: 269–278.3209351
FauchèreJL, ChartonM, KierLB, VerloopA, PliskaV (1988) Amino acid side chain parameters for correlation studies in biology and pharmacology. Int J Pept Protein Res 32: 269–278.3209351, FauchèreJL, ChartonM, KierLB, VerloopA, PliskaV (1988) Amino acid side chain parameters for correlation studies in biology and pharmacology. Int J Pept Protein Res 32: 269–278.3209351
( SubramanianS, KumarS (2006) Evolutionary anatomies of positions and types of diseaseassociated and neutral amino acid mutations in the human genome. BMC Genomics 7: 306.17144929)
SubramanianS, KumarS (2006) Evolutionary anatomies of positions and types of diseaseassociated and neutral amino acid mutations in the human genome. BMC Genomics 7: 306.17144929
SubramanianS, KumarS (2006) Evolutionary anatomies of positions and types of diseaseassociated and neutral amino acid mutations in the human genome. BMC Genomics 7: 306.17144929, SubramanianS, KumarS (2006) Evolutionary anatomies of positions and types of diseaseassociated and neutral amino acid mutations in the human genome. BMC Genomics 7: 306.17144929
( CalabreseR, CapriottiE, FariselliP, MartelliPL, CasadioR (2009) Functional annotations improve the predictive score of human disease-related mutations in proteins. Hum Mutat 30: 1237–1244.19514061)
CalabreseR, CapriottiE, FariselliP, MartelliPL, CasadioR (2009) Functional annotations improve the predictive score of human disease-related mutations in proteins. Hum Mutat 30: 1237–1244.19514061
CalabreseR, CapriottiE, FariselliP, MartelliPL, CasadioR (2009) Functional annotations improve the predictive score of human disease-related mutations in proteins. Hum Mutat 30: 1237–1244.19514061, CalabreseR, CapriottiE, FariselliP, MartelliPL, CasadioR (2009) Functional annotations improve the predictive score of human disease-related mutations in proteins. Hum Mutat 30: 1237–1244.19514061
Robert Finn, Jaina Mistry, John Tate, Penny Coggill, Andreas Heger, Joanne Pollington, O. Gavin, Prasad Gunasekaran, Goran Ceric, Kristoffer Forslund, Liisa Holm, Erik Sonnhammer, Sean Eddy, Alex Bateman (2007)
The Pfam protein families database
Nucleic Acids Research, 38
H. Berman, Tammy Battistuz, T. Bhat, Wolfgang Bluhm, Philip Bourne, K. Burkhardt, Zukang Feng, G. Gilliland, L. Iype, Shri Jain, Phoebe Fagan, Jessica Marvin, David Padilla, V. Ravichandran, B. Schneider, N. Thanki, H. Weissig, J. Westbrook, C. Zardecki
Electronic Reprint Biological Crystallography the Protein Data Bank Biological Crystallography the Protein Data Bank
J. Reumers, J. Schymkowitz, F. Rousseau (2009)
Using structural bioinformatics to investigate the impact of non synonymous SNPs and disease mutations: scope and limitations
BMC Bioinformatics, 10
( 1000 Genomes Project Consortium (2010) DurbinRM, AbecasisGR, AltshulerDL, AutonA, et al (2010) A map of human genome variation from population-scale sequencing. Nature 467: 1061–1073.20981092)
1000 Genomes Project Consortium (2010) DurbinRM, AbecasisGR, AltshulerDL, AutonA, et al (2010) A map of human genome variation from population-scale sequencing. Nature 467: 1061–1073.20981092
1000 Genomes Project Consortium (2010) DurbinRM, AbecasisGR, AltshulerDL, AutonA, et al (2010) A map of human genome variation from population-scale sequencing. Nature 467: 1061–1073.20981092, 1000 Genomes Project Consortium (2010) DurbinRM, AbecasisGR, AltshulerDL, AutonA, et al (2010) A map of human genome variation from population-scale sequencing. Nature 467: 1061–1073.20981092
( LevyED, Pereira-LealJB, ChothiaC, TeichmannSA (2006) 3D complex: a structural classification of protein complexes. PLoS Comput Biol 2: e155.17112313)
LevyED, Pereira-LealJB, ChothiaC, TeichmannSA (2006) 3D complex: a structural classification of protein complexes. PLoS Comput Biol 2: e155.17112313
LevyED, Pereira-LealJB, ChothiaC, TeichmannSA (2006) 3D complex: a structural classification of protein complexes. PLoS Comput Biol 2: e155.17112313, LevyED, Pereira-LealJB, ChothiaC, TeichmannSA (2006) 3D complex: a structural classification of protein complexes. PLoS Comput Biol 2: e155.17112313
P. Ng, S. Henikoff (2003)
SIFT: predicting amino acid changes that affect protein function
Nucleic acids research, 31 13
( NgPC, HenikoffS (2003) SIFT: Predicting amino acid changes that affect protein function. Nucleic Acids Res 31: 3812–3814.12824425)
NgPC, HenikoffS (2003) SIFT: Predicting amino acid changes that affect protein function. Nucleic Acids Res 31: 3812–3814.12824425
NgPC, HenikoffS (2003) SIFT: Predicting amino acid changes that affect protein function. Nucleic Acids Res 31: 3812–3814.12824425, NgPC, HenikoffS (2003) SIFT: Predicting amino acid changes that affect protein function. Nucleic Acids Res 31: 3812–3814.12824425
( LohmuellerKE, IndapAR, SchmidtS, BoykoAR, HernandezRD, et al (2008) Proportionally more deleterious genetic variation in European than in African populations. Nature 451: 994–997.18288194)
LohmuellerKE, IndapAR, SchmidtS, BoykoAR, HernandezRD, et al (2008) Proportionally more deleterious genetic variation in European than in African populations. Nature 451: 994–997.18288194
LohmuellerKE, IndapAR, SchmidtS, BoykoAR, HernandezRD, et al (2008) Proportionally more deleterious genetic variation in European than in African populations. Nature 451: 994–997.18288194, LohmuellerKE, IndapAR, SchmidtS, BoykoAR, HernandezRD, et al (2008) Proportionally more deleterious genetic variation in European than in African populations. Nature 451: 994–997.18288194
I. Jordan, F. Kondrashov, I. Adzhubei, Y. Wolf, E. Koonin, A. Kondrashov, S. Sunyaev (2005)
A universal trend of amino acid gain and loss in protein evolution
Nature, 433
P. Flicek, Bronwen Aken, B. Ballester, Kathryn Beal, E. Bragin, Simon Brent, Yuan Chen, P. Clapham, Guy Coates, S. Fairley, Stephen Fitzgerald, J. Fernandez-Banet, Leo Gordon, S. Gräf, Syed Haider, M. Hammond, K. Howe, Andrew Jenkinson, Nathan Johnson, Andreas Kähäri, Damian Keefe, S. Keenan, R. Kinsella, F. Kokocinski, Gautier Koscielny, Eugene Kulesha, D. Lawson, Ian Longden, Tim Massingham, W. McLaren, K. Megy, B. Overduin, Bethan Pritchard, Daniel Rios, Magali Ruffier, Michael Schuster, G. Slater, D. Smedley, Giulietta Spudich, Y. Tang, S. Trevanion, Albert Vilella, J. Vogel, S. White, S. Wilder, A. Zadissa, E. Birney, Fiona Cunningham, I. Dunham, R. Durbin, X. Fernández-Suárez, Javier Herrero, T. Hubbard, Anne Parker, G. Proctor, James Smith, S. Searle (2009)
Ensembl’s 10th year
Nucleic Acids Research, 38
The Consortium (2009)
The Universal Protein Resource (UniProt) in 2010
Nucleic Acids Research, 38
I. Mihalek, I. Res, O. Lichtarge (2004)
A family of evolution-entropy hybrid methods for ranking protein residues by importance.
Journal of molecular biology, 336 5
( González-PérezA, López-BigasN (2011) Improving the assessment of the outcome of nonsynonymous SNVs with a consensus deleteriousness score, Condel. Am J Hum Genet 88: 440–449.21457909)
González-PérezA, López-BigasN (2011) Improving the assessment of the outcome of nonsynonymous SNVs with a consensus deleteriousness score, Condel. Am J Hum Genet 88: 440–449.21457909
González-PérezA, López-BigasN (2011) Improving the assessment of the outcome of nonsynonymous SNVs with a consensus deleteriousness score, Condel. Am J Hum Genet 88: 440–449.21457909, González-PérezA, López-BigasN (2011) Improving the assessment of the outcome of nonsynonymous SNVs with a consensus deleteriousness score, Condel. Am J Hum Genet 88: 440–449.21457909
Sankar Subramanian, Sudhir Kumar (2006)
Evolutionary anatomies of positions and types of disease-associated and neutral amino acid mutations in the human genome
BMC Genomics, 7
C. Kosiol, N. Goldman (2005)
Different versions of the Dayhoff rate matrix.
Molecular biology and evolution, 22 2
S. Nakken, I. Alseth, Torbjørn Rognes (2007)
Computational prediction of the effects of non-synonymous single nucleotide polymorphisms in human DNA repair genes
Neuroscience, 145
( SunyaevS, RamenskyV, BorkP (2000) Towards a structural basis of human non-synonymous single nucleotide polymorphisms. Trends Genet 16: 198–200.10782110)
SunyaevS, RamenskyV, BorkP (2000) Towards a structural basis of human non-synonymous single nucleotide polymorphisms. Trends Genet 16: 198–200.10782110
SunyaevS, RamenskyV, BorkP (2000) Towards a structural basis of human non-synonymous single nucleotide polymorphisms. Trends Genet 16: 198–200.10782110, SunyaevS, RamenskyV, BorkP (2000) Towards a structural basis of human non-synonymous single nucleotide polymorphisms. Trends Genet 16: 198–200.10782110
( IengarP (2012) An analysis of substitution, deletion and insertion mutations in cancer genes. Nucleic Acids Res 40: 6401–6413.22492711)
IengarP (2012) An analysis of substitution, deletion and insertion mutations in cancer genes. Nucleic Acids Res 40: 6401–6413.22492711
IengarP (2012) An analysis of substitution, deletion and insertion mutations in cancer genes. Nucleic Acids Res 40: 6401–6413.22492711, IengarP (2012) An analysis of substitution, deletion and insertion mutations in cancer genes. Nucleic Acids Res 40: 6401–6413.22492711
( WalserJC, FuranoAV (2010) The mutational spectrum of non-CpG DNA varies with CpG content. Genome Res 20: 875–882.20498119)
WalserJC, FuranoAV (2010) The mutational spectrum of non-CpG DNA varies with CpG content. Genome Res 20: 875–882.20498119
WalserJC, FuranoAV (2010) The mutational spectrum of non-CpG DNA varies with CpG content. Genome Res 20: 875–882.20498119, WalserJC, FuranoAV (2010) The mutational spectrum of non-CpG DNA varies with CpG content. Genome Res 20: 875–882.20498119
A. Kong, M. Frigge, G. Másson, S. Besenbacher, S. Besenbacher, P. Sulem, Gísli Magnússon, S. Gudjonsson, A. Sigurdsson, Á. Jónasdóttir, A. Jonasdottir, Wendy Wong, G. Sigurdsson, G. Walters, S. Steinberg, H. Helgason, G. Thorleifsson, D. Gudbjartsson, Agnar Helgason, Agnar Helgason, O. Magnusson, U. Thorsteinsdóttir, U. Thorsteinsdóttir, K. Stefánsson, K. Stefánsson (2012)
Rate of de novo mutations and the importance of father’s age to disease risk
Nature, 488
J. Walser, A. Furano (2010)
The mutational spectrum of non-CpG DNA varies with CpG content.
Genome research, 20 7
E. Levy, J. Pereira-Leal, C. Chothia, S. Teichmann (2006)
3D Complex: A Structural Classification of Protein Complexes
PLoS Computational Biology, 2
( KongA, FriggeML, MassonG, BesenbacherS, SulemP, et al (2012) Rate of de novo mutations and the importance of father's age to disease risk. Nature 488: 471–475.22914163)
KongA, FriggeML, MassonG, BesenbacherS, SulemP, et al (2012) Rate of de novo mutations and the importance of father's age to disease risk. Nature 488: 471–475.22914163
KongA, FriggeML, MassonG, BesenbacherS, SulemP, et al (2012) Rate of de novo mutations and the importance of father's age to disease risk. Nature 488: 471–475.22914163, KongA, FriggeML, MassonG, BesenbacherS, SulemP, et al (2012) Rate of de novo mutations and the importance of father's age to disease risk. Nature 488: 471–475.22914163
P. Ng, S. Henikoff (2001)
Predicting deleterious amino acid substitutions.
Genome research, 11 5
A. Sali, T. Blundell (1993)
Comparative protein modelling by satisfaction of spatial restraints.
Journal of molecular biology, 234 3
( NgPC, HenikoffS (2006) Predicting the effects of amino acid substitutions on protein function. Annu Rev Genomics Hum Genet 7: 61–80.16824020)
NgPC, HenikoffS (2006) Predicting the effects of amino acid substitutions on protein function. Annu Rev Genomics Hum Genet 7: 61–80.16824020
NgPC, HenikoffS (2006) Predicting the effects of amino acid substitutions on protein function. Annu Rev Genomics Hum Genet 7: 61–80.16824020, NgPC, HenikoffS (2006) Predicting the effects of amino acid substitutions on protein function. Annu Rev Genomics Hum Genet 7: 61–80.16824020
Kristin Fabre, L. Ramaiah, R. Dregalla, C. Desaintes, M. Weil, S. Bailey, R. Ullrich (2011)
Murine Prkdc Polymorphisms Impact DNA-PKcs Function
, 175
A. Golovin, T. Oldfield, J. Tate, S. Velankar, G. Barton, H. Boutselakis, D. Dimitropoulos, J. Fillon, A. Hussain, J. Ionides, M. John, P. Keller, E. Krissinel, P. McNeil, A. Naim, R. Newman, A. Pajon, Jorge Pineda-Castillo, A. Rachedi, J. Copeland, A. Sitnov, S. Sobhany, A. Suarez-Uruena, Jawahar Swaminathan, M. Tagari, S. Tromm, W. Vranken, K. Henrick (2004)
E-MSD: an integrated data resource for bioinformatics
Nucleic Acids Research, 33
( ZuckerkandlE, DerancourtJ, VogelH (1971) Mutational trends and random processes in the evolution of informational macromolecules. J Mol Biol 59: 473–490.5571595)
ZuckerkandlE, DerancourtJ, VogelH (1971) Mutational trends and random processes in the evolution of informational macromolecules. J Mol Biol 59: 473–490.5571595
ZuckerkandlE, DerancourtJ, VogelH (1971) Mutational trends and random processes in the evolution of informational macromolecules. J Mol Biol 59: 473–490.5571595, ZuckerkandlE, DerancourtJ, VogelH (1971) Mutational trends and random processes in the evolution of informational macromolecules. J Mol Biol 59: 473–490.5571595
( PieperU, WebbBM, BarkanDT, Schneidman-DuhovnyD, SchlessingerA, et al (2011) ModBase, a database of annotated comparative protein structure models, and associated resources. Nucleic Acids Res 39: D465–D474.21097780)
PieperU, WebbBM, BarkanDT, Schneidman-DuhovnyD, SchlessingerA, et al (2011) ModBase, a database of annotated comparative protein structure models, and associated resources. Nucleic Acids Res 39: D465–D474.21097780
PieperU, WebbBM, BarkanDT, Schneidman-DuhovnyD, SchlessingerA, et al (2011) ModBase, a database of annotated comparative protein structure models, and associated resources. Nucleic Acids Res 39: D465–D474.21097780, PieperU, WebbBM, BarkanDT, Schneidman-DuhovnyD, SchlessingerA, et al (2011) ModBase, a database of annotated comparative protein structure models, and associated resources. Nucleic Acids Res 39: D465–D474.21097780
R. Steward, M. MacArthur, R. Laskowski, J. Thornton (2003)
Molecular basis of inherited diseases: a structural perspective.
Trends in genetics : TIG, 19 9
J. Amberger, C. Bocchini, A. Scott, A. Hamosh (2008)
McKusick's Online Mendelian Inheritance in Man (OMIM®)
Nucleic Acids Research, 37
Dennis Vitkup, C. Sander, G. Church (2003)
The amino-acid mutational spectrum of human genetic disease
Genome Biology, 4
Sungsam Gong, C. Worth, G. Bickerton, Semin Lee, D. Tanramluk, T. Blundell (2009)
Structural and functional restraints in the evolution of protein families and superfamilies.
Biochemical Society transactions, 37 Pt 4
( RossMG, RussC, CostelloM, HollingerA, LennonNJ, et al (2013) Characterizing and measuring bias in sequence data. Genome Biol 14: R51.23718773)
RossMG, RussC, CostelloM, HollingerA, LennonNJ, et al (2013) Characterizing and measuring bias in sequence data. Genome Biol 14: R51.23718773
RossMG, RussC, CostelloM, HollingerA, LennonNJ, et al (2013) Characterizing and measuring bias in sequence data. Genome Biol 14: R51.23718773, RossMG, RussC, CostelloM, HollingerA, LennonNJ, et al (2013) Characterizing and measuring bias in sequence data. Genome Biol 14: R51.23718773
( GongS, WorthCL, BickertonGRJ, LeeS, TanramlukD, et al (2009) Structural and functional restraints in the evolution of protein families and superfamilies. Biochem Soc Trans 37: 727–733.19614584)
GongS, WorthCL, BickertonGRJ, LeeS, TanramlukD, et al (2009) Structural and functional restraints in the evolution of protein families and superfamilies. Biochem Soc Trans 37: 727–733.19614584
GongS, WorthCL, BickertonGRJ, LeeS, TanramlukD, et al (2009) Structural and functional restraints in the evolution of protein families and superfamilies. Biochem Soc Trans 37: 727–733.19614584, GongS, WorthCL, BickertonGRJ, LeeS, TanramlukD, et al (2009) Structural and functional restraints in the evolution of protein families and superfamilies. Biochem Soc Trans 37: 727–733.19614584
Y. Benjamini, Y. Hochberg (1995)
Controlling the false discovery rate: a practical and powerful approach to multiple testing
Journal of the royal statistical society series b-methodological, 57
( NgPC, HenikoffS (2001) Predicting deleterious amino acid substitutions. Genome Res 11: 863–874.11337480)
NgPC, HenikoffS (2001) Predicting deleterious amino acid substitutions. Genome Res 11: 863–874.11337480
NgPC, HenikoffS (2001) Predicting deleterious amino acid substitutions. Genome Res 11: 863–874.11337480, NgPC, HenikoffS (2001) Predicting deleterious amino acid substitutions. Genome Res 11: 863–874.11337480
( SaliA, BlundellTL (1993) Comparative protein modelling by satisfaction of spatial restraints. J Mol Biol 234: 779–815.8254673)
SaliA, BlundellTL (1993) Comparative protein modelling by satisfaction of spatial restraints. J Mol Biol 234: 779–815.8254673
SaliA, BlundellTL (1993) Comparative protein modelling by satisfaction of spatial restraints. J Mol Biol 234: 779–815.8254673, SaliA, BlundellTL (1993) Comparative protein modelling by satisfaction of spatial restraints. J Mol Biol 234: 779–815.8254673
( MinutoloC, NadraAD, FernándezC, TaboasM, BuzzalinoN, et al (2011) Structure-based analysis of five novel disease-causing mutations in 21-hydroxylase-deficient patients. PLoS One 6: e15899.21264314)
MinutoloC, NadraAD, FernándezC, TaboasM, BuzzalinoN, et al (2011) Structure-based analysis of five novel disease-causing mutations in 21-hydroxylase-deficient patients. PLoS One 6: e15899.21264314
MinutoloC, NadraAD, FernándezC, TaboasM, BuzzalinoN, et al (2011) Structure-based analysis of five novel disease-causing mutations in 21-hydroxylase-deficient patients. PLoS One 6: e15899.21264314, MinutoloC, NadraAD, FernándezC, TaboasM, BuzzalinoN, et al (2011) Structure-based analysis of five novel disease-causing mutations in 21-hydroxylase-deficient patients. PLoS One 6: e15899.21264314
( MilburnD, LaskowskiRA, ThorntonJM (1998) Sequences annotated by structure: a tool to facilitate the use of structural information in sequence analysis. Protein Eng 11: 855–859.9862203)
MilburnD, LaskowskiRA, ThorntonJM (1998) Sequences annotated by structure: a tool to facilitate the use of structural information in sequence analysis. Protein Eng 11: 855–859.9862203
MilburnD, LaskowskiRA, ThorntonJM (1998) Sequences annotated by structure: a tool to facilitate the use of structural information in sequence analysis. Protein Eng 11: 855–859.9862203, MilburnD, LaskowskiRA, ThorntonJM (1998) Sequences annotated by structure: a tool to facilitate the use of structural information in sequence analysis. Protein Eng 11: 855–859.9862203
( McLarenW, PritchardB, RiosD, ChenY, FlicekP, et al (2010) Deriving the consequences of genomic variants with the Ensembl API and SNP Effect Predictor. Bioinformatics 26: 2069–2070.20562413)
McLarenW, PritchardB, RiosD, ChenY, FlicekP, et al (2010) Deriving the consequences of genomic variants with the Ensembl API and SNP Effect Predictor. Bioinformatics 26: 2069–2070.20562413
McLarenW, PritchardB, RiosD, ChenY, FlicekP, et al (2010) Deriving the consequences of genomic variants with the Ensembl API and SNP Effect Predictor. Bioinformatics 26: 2069–2070.20562413, McLarenW, PritchardB, RiosD, ChenY, FlicekP, et al (2010) Deriving the consequences of genomic variants with the Ensembl API and SNP Effect Predictor. Bioinformatics 26: 2069–2070.20562413
( PonstinglH, KabirT, GorseD, ThorntonJM (2005) Morphological aspects of oligomeric protein structures. Prog Biophys Mol Biol 89: 9–35.15895504)
PonstinglH, KabirT, GorseD, ThorntonJM (2005) Morphological aspects of oligomeric protein structures. Prog Biophys Mol Biol 89: 9–35.15895504
PonstinglH, KabirT, GorseD, ThorntonJM (2005) Morphological aspects of oligomeric protein structures. Prog Biophys Mol Biol 89: 9–35.15895504, PonstinglH, KabirT, GorseD, ThorntonJM (2005) Morphological aspects of oligomeric protein structures. Prog Biophys Mol Biol 89: 9–35.15895504
( ShenMY, SaliA (2006) Statistical potential for assessment and prediction of protein structures. Protein Sci 15: 2507–2524.17075131)
ShenMY, SaliA (2006) Statistical potential for assessment and prediction of protein structures. Protein Sci 15: 2507–2524.17075131
ShenMY, SaliA (2006) Statistical potential for assessment and prediction of protein structures. Protein Sci 15: 2507–2524.17075131, ShenMY, SaliA (2006) Statistical potential for assessment and prediction of protein structures. Protein Sci 15: 2507–2524.17075131
Sungsam Gong, T. Blundell (2010)
Structural and Functional Restraints on the Occurrence of Single Amino Acid Variations in Human Proteins
PLoS ONE, 5
H. Akashi, N. Osada, T. Ohta (2012)
Weak Selection and Protein Evolution
Genetics, 192
E. Levy (2007)
PiQSi: protein quaternary structure investigation.
Structure, 15 11
L. Hurst, E. Feil, E. Rocha (2006)
Protein evolution: Causes of trends in amino-acid gain and loss
Nature, 442
R. Apweiler, M. Martin, C. O’Donovan, M. Magrane, Y. Alam-Faruque, R. Antunes, D. Barrell, B. Bely, M. Bingley, David Binns, Lawrence Bower, Paul Browne, W. Chan, E. Dimmer, R. Eberhardt, A. Fedotov, R. Foulger, J. Garavelli, R. Huntley, Julius Jacobsen, M. Kleen, K. Laiho, R. Leinonen, D. Legge, Quan Lin, W. Liu, J. Luo, S. Orchard, S. Patient, D. Poggioli, Manuela Pruess, M. Corbett, G. Martino, M. Donnelly, P. Rensburg, A. Bairoch, L. Bougueleret, I. Xenarios, S. Altairac, A. Auchincloss, Ghislaine Argoud-Puy, K. Axelsen, Delphine Baratin, M. Blatter, B. Boeckmann, Jerven Bolleman, L. Bollondi, E. Boutet, Sb Quintaje, L. Breuza, A. Bridge, E. Decastro, L. Ciapina, D. Coral, E. Coudert, Isabelle Cusin, G. Delbard, M. Doche, D. Dornevil, Pd Roggli, S. Duvaud, A. Estreicher, L. Famiglietti, M. Feuermann, S. Gehant, N. Farriol-Mathis, Serenella Ferro, E. Gasteiger, A. Gateau, Gerritsen, A. Gos, N. Gruaz-Gumowski, U. Hinz, C. Hulo, N. Hulo, J. James, S. Jimenez, F. Jungo, T. Kappler, G. Keller, Corinne Lachaize, L. Lane-Guermonprez, P. Langendijk-Genevaux, Lara, P. Lemercier, D. Lieberherr, Tdo Lima, Mangold, X. Martin, P. Masson, M. Moinat, A. Morgat, A. Mottaz, S. Paesano, I. Pedruzzi, S. Pilbout, Pillet, S. Poux, Monica Pozzato, Nicole Redaschi, C. Rivoire, B. Roechert, Michel Schneider, Christian Sigrist, K. Sonesson, S. Staehli, Eleanor Stanley, A. Stutz, S. Sundaram, M. Tognolli, L. Verbregue, A. Veuthey, L. Yip, L. Zuletta, Cathy Wu, C. Arighi, L. Arminski, W. Barker, Chuming Chen, Youhai Chen, Z-Z Hu, Hongzhan Huang, R. Mazumder, P. McGarvey, D. Natale, J. Nchoutmboube, N. Petrova, N. Subramanian, Baris Suzek, U. Ugochukwu, S. Vasudevan, C. Vinayaka, L. Yeh, J. Zhang (2010)
The Universal Protein Resource (UniProt) in 2010
E. Krissinel, K. Henrick (2007)
Inference of macromolecular assemblies from crystalline state.
Journal of molecular biology, 372 3
O. M. (1978)
22 A Model of Evolutionary Change in Proteins
Amino Acid Mutation Characteristics PLOS Computational Biology | www.ploscompbiol
( JordanIK, KondrashovFA, AdzhubeiIA, WolfYI, KooninEV, et al (2005) A universal trend of amino acid gain and loss in protein evolution. Nature 433: 633–638.15660107)
JordanIK, KondrashovFA, AdzhubeiIA, WolfYI, KooninEV, et al (2005) A universal trend of amino acid gain and loss in protein evolution. Nature 433: 633–638.15660107
JordanIK, KondrashovFA, AdzhubeiIA, WolfYI, KooninEV, et al (2005) A universal trend of amino acid gain and loss in protein evolution. Nature 433: 633–638.15660107, JordanIK, KondrashovFA, AdzhubeiIA, WolfYI, KooninEV, et al (2005) A universal trend of amino acid gain and loss in protein evolution. Nature 433: 633–638.15660107
Y. Benjamini, D. Yekutieli (2001)
THE CONTROL OF THE FALSE DISCOVERY RATE IN MULTIPLE TESTING UNDER DEPENDENCY
Annals of Statistics, 29
( StensonPD, BallE, HowellsK, PhillipsA, MortM, et al (2008) Human Gene Mutation Database: towards a comprehensive central mutation database. J Med Genet 45: 124–126.18245393)
StensonPD, BallE, HowellsK, PhillipsA, MortM, et al (2008) Human Gene Mutation Database: towards a comprehensive central mutation database. J Med Genet 45: 124–126.18245393
StensonPD, BallE, HowellsK, PhillipsA, MortM, et al (2008) Human Gene Mutation Database: towards a comprehensive central mutation database. J Med Genet 45: 124–126.18245393, StensonPD, BallE, HowellsK, PhillipsA, MortM, et al (2008) Human Gene Mutation Database: towards a comprehensive central mutation database. J Med Genet 45: 124–126.18245393
S. Whelan, N. Goldman (2001)
A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach.
Molecular biology and evolution, 18 5
( VitkupD, SanderC, ChurchGM (2003) The amino-acid mutational spectrum of human genetic disease. Genome Biol 4: R72.14611658)
VitkupD, SanderC, ChurchGM (2003) The amino-acid mutational spectrum of human genetic disease. Genome Biol 4: R72.14611658
VitkupD, SanderC, ChurchGM (2003) The amino-acid mutational spectrum of human genetic disease. Genome Biol 4: R72.14611658, VitkupD, SanderC, ChurchGM (2003) The amino-acid mutational spectrum of human genetic disease. Genome Biol 4: R72.14611658
( YekutieliYBD (2001) The control of the false discovery rate in multiple testing under dependency. Ann Stat 29 (4) 1165–1188.)
YekutieliYBD (2001) The control of the false discovery rate in multiple testing under dependency. Ann Stat 29 (4) 1165–1188.
YekutieliYBD (2001) The control of the false discovery rate in multiple testing under dependency. Ann Stat 29 (4) 1165–1188., YekutieliYBD (2001) The control of the false discovery rate in multiple testing under dependency. Ann Stat 29 (4) 1165–1188.
( AmbergerJ, BocchiniCA, ScottAF, HamoshA (2009) McKusick's Online Mendelian Inheritance in Man (OMIM). Nucleic Acids Res 37: D793–D796.18842627)
AmbergerJ, BocchiniCA, ScottAF, HamoshA (2009) McKusick's Online Mendelian Inheritance in Man (OMIM). Nucleic Acids Res 37: D793–D796.18842627
AmbergerJ, BocchiniCA, ScottAF, HamoshA (2009) McKusick's Online Mendelian Inheritance in Man (OMIM). Nucleic Acids Res 37: D793–D796.18842627, AmbergerJ, BocchiniCA, ScottAF, HamoshA (2009) McKusick's Online Mendelian Inheritance in Man (OMIM). Nucleic Acids Res 37: D793–D796.18842627
( ReumersJ, SchymkowitzJ, RousseauF (2009) Using structural bioinformatics to investigate the impact of non synonymous SNPs and disease mutations: scope and limitations. BMC Bioinformatics 10 Suppl 8: S9.)
ReumersJ, SchymkowitzJ, RousseauF (2009) Using structural bioinformatics to investigate the impact of non synonymous SNPs and disease mutations: scope and limitations. BMC Bioinformatics 10 Suppl 8: S9.
ReumersJ, SchymkowitzJ, RousseauF (2009) Using structural bioinformatics to investigate the impact of non synonymous SNPs and disease mutations: scope and limitations. BMC Bioinformatics 10 Suppl 8: S9., ReumersJ, SchymkowitzJ, RousseauF (2009) Using structural bioinformatics to investigate the impact of non synonymous SNPs and disease mutations: scope and limitations. BMC Bioinformatics 10 Suppl 8: S9.
R. Calabrese, E. Capriotti, P. Fariselli, P. Martelli, R. Casadio (2009)
Functional annotations improve the predictive score of human disease‐related mutations in proteins
Human Mutation, 30
Y. Bromberg, Guy Yachdav, B. Rost (2008)
SNAP predicts effect of mutations on protein function
Bioinformatics, 24
W. McLaren, Bethan Pritchard, Daniel Rios, Yuan Chen, P. Flicek, Fiona Cunningham, Alfonso Valencia
Bioinformatics Applications Note Databases and Ontologies Deriving the Consequences of Genomic Variants with the Ensembl Api and Snp Effect Predictor
P. Debye (1934)
The Crystalline State
Nature, 134
K. Lohmueller, Amit Indap, S. Schmidt, A. Boyko, Ryan Hernandez, M. Hubisz, J. Sninsky, T. White, S. Sunyaev, R. Nielsen, A. Clark, C. Bustamante (2008)
Proportionally more deleterious genetic variation in European than in African populations
Nature, 451
( HochbergYBY (1995) Controlling the false discovery rate: A practical and powerful approach to multiple testing. J Roy Statistical Society 57 (1) 289–300.)
HochbergYBY (1995) Controlling the false discovery rate: A practical and powerful approach to multiple testing. J Roy Statistical Society 57 (1) 289–300.
HochbergYBY (1995) Controlling the false discovery rate: A practical and powerful approach to multiple testing. J Roy Statistical Society 57 (1) 289–300., HochbergYBY (1995) Controlling the false discovery rate: A practical and powerful approach to multiple testing. J Roy Statistical Society 57 (1) 289–300.
( AkashiH, OsadaN, OhtaT (2012) Weak selection and protein evolution. Genetics 192: 15–31.22964835)
AkashiH, OsadaN, OhtaT (2012) Weak selection and protein evolution. Genetics 192: 15–31.22964835
AkashiH, OsadaN, OhtaT (2012) Weak selection and protein evolution. Genetics 192: 15–31.22964835, AkashiH, OsadaN, OhtaT (2012) Weak selection and protein evolution. Genetics 192: 15–31.22964835
( AdzhubeiIA, SchmidtS, PeshkinL, RamenskyVE, GerasimovaA, et al (2010) A method and server for predicting damaging missense mutations. Nat Methods 7: 248–249.20354512)
AdzhubeiIA, SchmidtS, PeshkinL, RamenskyVE, GerasimovaA, et al (2010) A method and server for predicting damaging missense mutations. Nat Methods 7: 248–249.20354512
AdzhubeiIA, SchmidtS, PeshkinL, RamenskyVE, GerasimovaA, et al (2010) A method and server for predicting damaging missense mutations. Nat Methods 7: 248–249.20354512, AdzhubeiIA, SchmidtS, PeshkinL, RamenskyVE, GerasimovaA, et al (2010) A method and server for predicting damaging missense mutations. Nat Methods 7: 248–249.20354512
( PonstinglH, HenrickK, ThorntonJM (2000) Discriminating between homodimeric and monomeric proteins in the crystalline state. Proteins 41: 47–57.10944393)
PonstinglH, HenrickK, ThorntonJM (2000) Discriminating between homodimeric and monomeric proteins in the crystalline state. Proteins 41: 47–57.10944393
PonstinglH, HenrickK, ThorntonJM (2000) Discriminating between homodimeric and monomeric proteins in the crystalline state. Proteins 41: 47–57.10944393, PonstinglH, HenrickK, ThorntonJM (2000) Discriminating between homodimeric and monomeric proteins in the crystalline state. Proteins 41: 47–57.10944393
M. Ross, C. Russ, Maura Costello, Andrew Hollinger, N. Lennon, Ryan Hegarty, C. Nusbaum, D. Jaffe (2013)
Characterizing and measuring bias in sequence data
Genome Biology, 14
Christian Sigrist, L. Cerutti, N. Hulo, Alexandre Gattiker, L. Falquet, M. Pagni, A. Bairoch, P. Bucher (2002)
PROSITE: A Documented Database Using Patterns and Profiles as Motif Descriptors
Briefings in bioinformatics, 3 3
A. Gonzalez-Perez, N. López-Bigas (2011)
Improving the assessment of the outcome of nonsynonymous SNVs with a consensus deleteriousness score, Condel.
American journal of human genetics, 88 4
( HenrickK, ThorntonJM (1998) PQS: a protein quaternary structure file server. Trends Biochem Sci 23: 358–361.9787643)
HenrickK, ThorntonJM (1998) PQS: a protein quaternary structure file server. Trends Biochem Sci 23: 358–361.9787643
HenrickK, ThorntonJM (1998) PQS: a protein quaternary structure file server. Trends Biochem Sci 23: 358–361.9787643, HenrickK, ThorntonJM (1998) PQS: a protein quaternary structure file server. Trends Biochem Sci 23: 358–361.9787643
U. Pieper, N. Eswar, F. Davis, Hannes Braberg, M. Madhusudhan, A. Rossi, M. Martí-Renom, R. Karchin, Ben Webb, David Eramian, Min-Yi Shen, L. Kelly, F. Melo, A. Sali (2005)
MODBASE: a database of annotated comparative protein structure models and associated resources
Nucleic Acids Research, 34
( 1000 Genomes Project Consortium (2012) AbecasisGR, AutonA, BrooksLD, DePristoMA, et al (2012) An integrated map of genetic variation from 1,092 human genomes. Nature 491: 56–65.23128226)
1000 Genomes Project Consortium (2012) AbecasisGR, AutonA, BrooksLD, DePristoMA, et al (2012) An integrated map of genetic variation from 1,092 human genomes. Nature 491: 56–65.23128226
1000 Genomes Project Consortium (2012) AbecasisGR, AutonA, BrooksLD, DePristoMA, et al (2012) An integrated map of genetic variation from 1,092 human genomes. Nature 491: 56–65.23128226, 1000 Genomes Project Consortium (2012) AbecasisGR, AutonA, BrooksLD, DePristoMA, et al (2012) An integrated map of genetic variation from 1,092 human genomes. Nature 491: 56–65.23128226
Gil Abec, G. McVean, David Be, David (Co-Chair), Richard (Co-Chair), G. Abecasis, D. Bentley, A. Chakravarti, A. Clark, P. Donnelly, Evan Eichler, Paul Flicek, S. Gabriel, Richard Gibbs, E. Green, M. Hurles, B. Knoppers, J. Korbel, E. Lander, Charles Lee, H. Lehrach, E. Mardis, Gabor Marth, G. McVean, D. Nickerson, Jeanette Schmidt, S. Sherry, Jun Wang, R. Wilson, Richard Lewi, Richard Lewi, Richard Investigator), H. Dinh, C. Kovar, Sandy Lee, L. Lewis, D. Muzny, Jeff Reid, Min Wang, Jun Jiang, Jun Investigator), X. Fang, Xiaosen Guo, Min Jian, Hui Jiang, Xin Jin, Guoqing Li, Jingxiang Li, Yingrui Li, Zhuo Li, Xiao Liu, Yao Lu, Xuedi Ma, Zheng Su, S. Tai, M. Tang, Bo Wang, Guangbiao Wang, Honglong Wu, Renhua Wu, Ye Yin, Wenwei Zhang, Jiao Zhao, Meiru Zhao, Xiaole Zheng, Yan Zhou, Eric Gabriel, Eric Investigator), D. Altshuler, Stacey (Co-Chair), N. Gupta, Paul Sm, Paul Investigator), Laura Clarke, R. Leinonen, Richard Smith, Xiangqun Zheng-Bradley, David Humphray, David Investigator), R. Grocock, S. Humphray, Terena James, Z. Kingsbury, Hans (Project, Hans Investigator), Ralf Leader), Marcus Albrecht, V. Amstislavskiy, T. Borodina, M. Lienhard, F. Mertes, M. Sultan, B. Timmermann, M. Yaspo, Stephen Investigator), Gil Investigator), Elaine Wils, Elaine (Co-Chair), Richard Investigator), L. Fulton, R. Fulton, G. Weinstock, Richard Bu, Richard Investigator), Senduran Balasubramaniam, J. Burton, P. Danecek, Thomas Keane, Anja Kolb-Kokocinski, Shane McCarthy, J. Stalker, Michael Quail, Jeanette Web, Jeanette Web, Jeanette Investigator), C. Davies, J. Gollub, Teresa Webster, Brant Wong, Yiping Zhan, Adam Investigator), Richard Leader), Fuli Leader), M. Bainbridge, Danny Challis, U. Evani, James Lu, U. Nagaswamy, A. Sabo, Yi Wang, Jin Yu, Jun Li, L. Coin, L. Fang, Qibin Li, Zhenyu Li, Haoxiang Lin, Binghang Liu, Ruibang Luo, Nan Qin, Haojing Shao, Bingqiang Wang, Yinlong Xie, C. Ye, Chang Yu, Fan Zhang, Hancheng Zheng, Hongmei Zhu, Gabor Lee, Gabor Investigator), Erik Garrison, Deniz Kural, Wan-Ping Lee, Wen Leong, Alistair Ward, Jiantao Wu, Mengyao Zhang, Charles S, Charles Investigator), Lauren Griffin, Chih-heng Hsieh, Ryan Mills, Xinghua Shi, Marcin Grotthuss, Chengsheng Zhang, Mark Le, Mark Investigator), Mark Leader), E. Banks, G. Bhatia, Mauricio Carneiro, G. Angel, G. Genovese, R. Handsaker, C. Hartl, S. Mccarroll, J. Nemesh, R. Poplin, S. Schaffner, Khalid Shakir, Seungtai Makarov, Seungtai Investigator), J. Lihm, Vladimir Makarov, Hanjun Kim, Hanjun Investigator), Wook Kim, Ki Kim, Jan Rausch, Jan Investigator), T. Rausch, Paul Cunnin, Kathryn Beal, Fiona Cunningham, Javier Herrero, W. McLaren, G. Ritchie, Andrew Ro, Andrew Investigator), S. Gottipati, A. Keinan, J. Rodriguez-Flores, Pardis T, Pardis Investigator), S. Grossman, S. Tabrizi, Ridhi Tariyal, David Stenson, David Investigator), E. Ball, P. Stenson, David Keir, B. Barnes, Markus Bauer, R. Cheetham, Tony Cox, M. Eberle, Scott Kahn, L. Murray, J. Peden, Richard Shaw, Kai Investigator), Mark Walker, Mark Investigator), Miriam Konkel, Jerilyn Walker, Daniel Lek, Daniel Investigator), M. Lek, Vyacheslav Herwig, Sudbrak Leader), R. Herwig, Mark Investigator), Carlos V, Carlos Investigator), J. Byrnes, Francisco Vega, S. Gravel, E. Kenny, J. Kidd, P. Lacroute, B. Maples, A. Moreno-Estrada, Fouad Zakharia, Eran Baran, Eran Investigator), Yael Baran, David Home, David Investigator), Alexis Christoforides, Nils Homer, Tyler Izatt, Ahmet Kurdoglu, Shripad Sinari, Kevin Squire, Stephen Xiao, Chunlin Xiao, Jonathan Ye, Jonathan Investigator), V. Bafna, Kenny Ye, Esteban (Princ, Esteban Investigator), Ryan Investigator), Christopher Gignoux, David Ke, David Investigator), Sol Katzman, W. Kent, Bryan Howie, Andres Investigator), Emmanouil Lappalainen, Emmanouil Investigator), T. Lappalainen, Scott Tallon, Scott Investigator), Xinyue Liu, A. Maroo, L. Tallon, Jeffrey Michelson, Jeffrey Investigator), L. Michelson, Gonçalo K, Gonçalo (Co-Chair), Hyun Leader), Paul Anderson, A. Angius, A. Bigham, T. Blackwell, F. Busonero, F. Cucca, C. Fuchsberger, Chris Jones, G. Jun, Yun Li, R. Lyons, A. Maschio, E. Porcu, F. Reinier, S. Sanna, D. Schlessinger, C. Sidore, Adrian Tan, Mary Trost, Philip Hodgkinson, Philip Investigator), A. Hodgkinson, Gerton (Principal, Gerton Investigator), Gil (Co-Chair), Jonathan Investigator), Simon Investigator), C. Churchhouse, O. Delaneau, Anjali Gupta-Hinch, Z. Iqbal, I. Mathieson, A. Rimmer, Dionysia Xifara, Taras Investigator), Yunxin Xiong, Yunxin Investigator), Xiaoming Liu, Momiao Xiong, Lynn Xing, Lynn Investigator), D. Witherspoon, Jinchuan Xing, Evan (Princip, Evan Investigator), Brian Investigator), C. Alkan, Iman Hajirasouliha, F. Hormozdiari, Arthur Ko, Peter Sudmant, Elaine Chinwalla, Elaine Investigator), Ken Chen, A. Chinwalla, L. Ding, D. Dooling, D. Koboldt, M. McLellan, J. Wallis, M. Wendl, Qunyuan Zhang, Richard (Principal, Matthew Investigator), Chris Investigator), C. Albers, Q. Ayub, Yuan Chen, A. Coffey, V. Colonna, N. Huang, L. Jostins, Heng Li, A. Scally, Klaudia Walter, Yali Xue, Yujun Zhang, Mark Balasubra, Mark Investigator), A. Abyzov, S. Balasubramanian, Jieming Chen, Declan Clarke, Yao Fu, L. Habegger, A. Harmanci, Mike Jin, Ekta Khurana, Xinmeng Mu, Cristina Sisu, Yingrui (Co-Chair), Yingrui Zhu, Charles Hs, Charles (Co-Chair), Gabor Lee, Steven Ang, Steven Leader), Jeremiah Degenhardt, Paul Zheng, Jan Rausch, Jan (Co-Chair), A. Stütz, David Chee, David Homer, Deanna Xiao, D. Church, Jonathan Ye, J. Michaelson, Gerton (Principal, David Xing, Evan Alkan, Evan (Co-Chair), Ken Wallis, Matthew Blackbu, Matthew (Co-Chair), Benjamin Blackburne, S. Lindsay, Z. Ning, Mark Clar, Mark Investigator), Richard (Proj, Richard (Proj, Richard (Co-Chair), Xiaosen Wu, Gabor Garrison, Gabor (Co-Chair), Guillermo Poplin, M. DePristo, Andrew Rodriguez-Flores, Carlos Gravel, David Home, Gonçalo Kang, Gonçalo Investigator), Hyun Kang, Elaine Ful, Elaine Investigator), Richard Ke, Mark Balasubramanian, Erik Bainbridge, Richard Yu, F. Yu, Guillermo Handsaker, Paul Cunnin, Carlos Vega, David Kurdoglu, Chris Ch, Chris (Co-Chair), A. Frankish, J. Harrow, Mark Abyzo, Mark (Co-Chair), Richard K, Richard K, G. Fowler, Walker Hale, D. Kalra, Jun Zheng, Paul Clarke, Paul (Co-Chair), Laura Leader), Jonathan Barker, G. Kelman, Eugene Kulesha, Rajesh Radhakrishnan, Asier Roa, Dmitriy Smirnov, Ian Streeter, I. Toneva, B. Vaughan, David Kahn, Ralf Lienhard, David Kurdoglu, Stephen Ananiev, Stephen (Co-Chair), Victor Ananiev, Zinaida Belaia, Dimitriy Beloslyudtsev, Nathan Bouk, Chao Chen, Robert Cohen, Charles Cook, John Garner, T. Hefferon, M. Kimelman, Chunlei Liu, John Lopez, Peter Meric, Christa O'Sullivan, Yu. Ostapchuk, Lon Phan, Sergiy Ponomarov, Valerie Schneider, Eugene Shekhtman, K. Sirotkin, D. Slotta, Hua Zhang, Can Ko, Aravinda Abecasi, Aravinda (Co-Chair), Bartha (Co-Chair), G. Abecasis, K. Barnes, C. Beiswanger, E. Burchard, C. Bustamante, Hongyu Cai, H. Cao, R. Durbin, N. Gharani, Richard Gibbs, B. Henn, Danielle Jones, L. Jorde, J. Kaye, A. Kent (2012)
An integrated map of genetic variation from 1,092 human genomes
Nature, 491
( BermanHM, WestbrookJ, FengZ, GillilandG, BhatTN, et al (2000) The Protein Data Bank. Nucleic Acids Res 28: 235–242.10592235)
BermanHM, WestbrookJ, FengZ, GillilandG, BhatTN, et al (2000) The Protein Data Bank. Nucleic Acids Res 28: 235–242.10592235
BermanHM, WestbrookJ, FengZ, GillilandG, BhatTN, et al (2000) The Protein Data Bank. Nucleic Acids Res 28: 235–242.10592235, BermanHM, WestbrookJ, FengZ, GillilandG, BhatTN, et al (2000) The Protein Data Bank. Nucleic Acids Res 28: 235–242.10592235
( SigristCJA, CeruttiL, HuloN, GattikerA, FalquetL, et al (2002) PROSITE: a documented database using patterns and profiles as motif descriptors. Brief Bioinform 3: 265–274.12230035)
SigristCJA, CeruttiL, HuloN, GattikerA, FalquetL, et al (2002) PROSITE: a documented database using patterns and profiles as motif descriptors. Brief Bioinform 3: 265–274.12230035
SigristCJA, CeruttiL, HuloN, GattikerA, FalquetL, et al (2002) PROSITE: a documented database using patterns and profiles as motif descriptors. Brief Bioinform 3: 265–274.12230035, SigristCJA, CeruttiL, HuloN, GattikerA, FalquetL, et al (2002) PROSITE: a documented database using patterns and profiles as motif descriptors. Brief Bioinform 3: 265–274.12230035
K. Henrick, J. Thornton (1998)
PQS: a protein quaternary structure file server.
Trends in biochemical sciences, 23 9
( WhelanS, GoldmanN (2001) A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach. Mol Biol Evol 18: 691–699.11319253)
WhelanS, GoldmanN (2001) A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach. Mol Biol Evol 18: 691–699.11319253
WhelanS, GoldmanN (2001) A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach. Mol Biol Evol 18: 691–699.11319253, WhelanS, GoldmanN (2001) A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach. Mol Biol Evol 18: 691–699.11319253
C. Worth, R. Preissner, T. Blundell (2011)
SDM—a server for predicting effects of mutations on protein stability and malfunction
Nucleic Acids Research, 39
( GongS, BlundellTL (2010) Structural and functional restraints on the occurrence of single amino acid variations in human proteins. PLoS One 5: e9186.20169194)
GongS, BlundellTL (2010) Structural and functional restraints on the occurrence of single amino acid variations in human proteins. PLoS One 5: e9186.20169194
GongS, BlundellTL (2010) Structural and functional restraints on the occurrence of single amino acid variations in human proteins. PLoS One 5: e9186.20169194, GongS, BlundellTL (2010) Structural and functional restraints on the occurrence of single amino acid variations in human proteins. PLoS One 5: e9186.20169194
P. Iengar (2012)
An analysis of substitution, deletion and insertion mutations in cancer genes
Nucleic Acids Research, 40
C. Minutolo, A. Nadra, C. Fernández, Melisa Taboas, N. Buzzalino, B. Casali, S. Belli, E. Charreau, L. Alba, L. Dain (2011)
Structure-Based Analysis of Five Novel Disease-Causing Mutations in 21-Hydroxylase-Deficient Patients
PLoS ONE, 6
( FabreKM, RamaiahL, DregallaRC, DesaintesC, WeilMM, et al (2011) Murine Prkdc polymorphisms impact DNA-PKcs function. Radiat Res 175: 493–500.21265624)
FabreKM, RamaiahL, DregallaRC, DesaintesC, WeilMM, et al (2011) Murine Prkdc polymorphisms impact DNA-PKcs function. Radiat Res 175: 493–500.21265624
FabreKM, RamaiahL, DregallaRC, DesaintesC, WeilMM, et al (2011) Murine Prkdc polymorphisms impact DNA-PKcs function. Radiat Res 175: 493–500.21265624, FabreKM, RamaiahL, DregallaRC, DesaintesC, WeilMM, et al (2011) Murine Prkdc polymorphisms impact DNA-PKcs function. Radiat Res 175: 493–500.21265624
( KamarajB, PurohitR (2013) Computational screening of disease-associated mutations in OCA2 gene. Cell Biochem Biophys 1–13.)
KamarajB, PurohitR (2013) Computational screening of disease-associated mutations in OCA2 gene. Cell Biochem Biophys 1–13.
KamarajB, PurohitR (2013) Computational screening of disease-associated mutations in OCA2 gene. Cell Biochem Biophys 1–13., KamarajB, PurohitR (2013) Computational screening of disease-associated mutations in OCA2 gene. Cell Biochem Biophys 1–13.
D. Milburn, R. Laskowski, J. Thornton (1998)
Sequences annotated by structure: a tool to facilitate the use of structural information in sequence analysis.
Protein engineering, 11 10
( VelankarS, McNeilP, Mittard-RunteV, SuarezA, BarrellD, et al (2005) E-MSD: an integrated data resource for bioinformatics. Nucleic Acids Res 33: D262–D265.15608192)
VelankarS, McNeilP, Mittard-RunteV, SuarezA, BarrellD, et al (2005) E-MSD: an integrated data resource for bioinformatics. Nucleic Acids Res 33: D262–D265.15608192
VelankarS, McNeilP, Mittard-RunteV, SuarezA, BarrellD, et al (2005) E-MSD: an integrated data resource for bioinformatics. Nucleic Acids Res 33: D262–D265.15608192, VelankarS, McNeilP, Mittard-RunteV, SuarezA, BarrellD, et al (2005) E-MSD: an integrated data resource for bioinformatics. Nucleic Acids Res 33: D262–D265.15608192
( DayhoffM, SchwartzR, OrcuttB (1978) A model of evolutionary change in proteins. Atlas of Protein Sequence and Structure 5 (3) 345–351.)
DayhoffM, SchwartzR, OrcuttB (1978) A model of evolutionary change in proteins. Atlas of Protein Sequence and Structure 5 (3) 345–351.
DayhoffM, SchwartzR, OrcuttB (1978) A model of evolutionary change in proteins. Atlas of Protein Sequence and Structure 5 (3) 345–351., DayhoffM, SchwartzR, OrcuttB (1978) A model of evolutionary change in proteins. Atlas of Protein Sequence and Structure 5 (3) 345–351.
Min-Yi Shen, A. Sali (2006)
Statistical potential for assessment and prediction of protein structures
Protein Science, 15
E. Zuckerkandl, J. Derancourt, H. Vogel (1971)
Mutational trends and random processes in the evolution of informational macromolecules.
Journal of molecular biology, 59 3
( NakkenS, AlsethI, RognesT (2007) Computational prediction of the effects of non-synonymous single nucleotide polymorphisms in human DNA repair genes. Neuroscience 145: 1273–1279.17055652)
NakkenS, AlsethI, RognesT (2007) Computational prediction of the effects of non-synonymous single nucleotide polymorphisms in human DNA repair genes. Neuroscience 145: 1273–1279.17055652
NakkenS, AlsethI, RognesT (2007) Computational prediction of the effects of non-synonymous single nucleotide polymorphisms in human DNA repair genes. Neuroscience 145: 1273–1279.17055652, NakkenS, AlsethI, RognesT (2007) Computational prediction of the effects of non-synonymous single nucleotide polymorphisms in human DNA repair genes. Neuroscience 145: 1273–1279.17055652
( WorthCL, PreissnerR, BlundellTL (2011) Sdm–a server for predicting effects of mutations on protein stability and malfunction. Nucleic Acids Res 39: W215–W222.21593128)
WorthCL, PreissnerR, BlundellTL (2011) Sdm–a server for predicting effects of mutations on protein stability and malfunction. Nucleic Acids Res 39: W215–W222.21593128
WorthCL, PreissnerR, BlundellTL (2011) Sdm–a server for predicting effects of mutations on protein stability and malfunction. Nucleic Acids Res 39: W215–W222.21593128, WorthCL, PreissnerR, BlundellTL (2011) Sdm–a server for predicting effects of mutations on protein stability and malfunction. Nucleic Acids Res 39: W215–W222.21593128
R. Laskowski (2008)
PDBsum new things
Nucleic Acids Research, 37
( StewardRE, MacArthurMW, LaskowskiRA, ThorntonJM (2003) Molecular basis of inherited diseases: a structural perspective. Trends Genet 19: 505–513.12957544)
StewardRE, MacArthurMW, LaskowskiRA, ThorntonJM (2003) Molecular basis of inherited diseases: a structural perspective. Trends Genet 19: 505–513.12957544
StewardRE, MacArthurMW, LaskowskiRA, ThorntonJM (2003) Molecular basis of inherited diseases: a structural perspective. Trends Genet 19: 505–513.12957544, StewardRE, MacArthurMW, LaskowskiRA, ThorntonJM (2003) Molecular basis of inherited diseases: a structural perspective. Trends Genet 19: 505–513.12957544
G. Abecasis, D. Altshuler, A. Auton, L. Brooks, R. Durbin, R. Gibbs, M. Hurles, G. McVean (2010)
A map of human genome variation from population-scale sequencing
Nature, 467
J. Fauchère, M. Charton, L. Kier, A. Verloop, V. Pliska (2009)
Amino acid side chain parameters for correlation studies in biology and pharmacology.
International journal of peptide and protein research, 32 4
U Pieper (2011)
ModBase, a database of annotated comparative protein structure models, and associated resources
Nucleic Acids Res, 39

Publisher: Public Library of Science (PLoS) Journal
Copyright: Copyright: © 2013 de Beer et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Funding: This work was supported in part by the National Institutes of Health grant GM094585, by the U. S. Department of Energy, Office of Biological and Environmental Research, under contract DE-AC02-06CH11357 (Midwest Center for Structural Genomics) as well as EMBL-EBI. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Competing interests: The authors have declared that no competing interests exist.
ISSN: 1553-734X
eISSN: 1553-7358
DOI: 10.1371/journal.pcbi.1003382
Publisher site: See Article on Publisher Site

Abstract

The 1000 Genomes Project data provides a natural background dataset for amino acid germline mutations in humans. Since the direction of mutation is known, the amino acid exchange matrix generated from the observed nucleotide variants is asymmetric and the mutabilities of the different amino acids are very different. These differences predominantly reflect preferences for nucleotide mutations in the DNA (especially the high mutation rate of the CpG dinucleotide, which makes arginine mutability very much higher than other amino acids) rather than selection imposed by protein structure constraints, although there is evidence for the latter as well. The variants occur predominantly on the surface of proteins (82%), with a slight preference for sites which are more exposed and less well conserved than random. Mutations to functional residues occur about half as often as expected by chance. The disease-associated amino acid variant distributions in OMIM are radically different from those expected on the basis of the 1000 Genomes dataset. The disease-associated variants preferentially occur in more conserved sites, compared to 1000 Genomes mutations. Many of the amino acid exchange profiles appear to exhibit an anti-correlation, with common exchanges in one dataset being rare in the other. Disease-associated variants exhibit more extreme differences in amino acid size and hydrophobicity. More modelling of the mutational processes at the nucleotide level is needed, but these observations should contribute to an improved prediction of the effects of specific variants in humans. Citation: de Beer TAP, Laskowski RA, Parks SL, Sipos B, Goldman N, et al. (2013) Amino Acid Changes in Disease-Associated Variants Differ Radically from Variants Observed in the 1000 Genomes Project Dataset. PLoS Comput Biol 9(12): e1003382. doi:10.1371/journal.pcbi.1003382 Editor: Yana Bromberg, Rutgers University, United States of America Received April 29, 2013; Accepted October 22, 2013; Published December 12, 2013 Copyright: 2013 de Beer et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Funding: This work was supported in part by the National Institutes of Health grant GM094585, by the U. S. Department of Energy, Office of Biological and Environmental Research, under contract DE-AC02-06CH11357 (Midwest Center for Structural Genomics) as well as EMBL-EBI. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Competing Interests: The authors have declared that no competing interests exist. * E-mail: [email protected] exploring their structural characteristics and preferences. The Introduction reports from the 1000 Genomes Consortium [1,2] have focused on With the release of the 1000 Genomes Project (1 kG) data [1], it genome and nucleotide variation, and other papers consider has become feasible to study human protein variation on a large mutations in association with a specific disease (e.g. cancer) [3]. scale. The main aim of the 1 kG project was to discover and Various databases such as the Online database of Mendelian characterize at least 95% of human DNA variants (with a Inheritance in Man (OMIM, [4]), the UniProtKB human frequency of occurrence of .1%) found in multiple human polymorphism set (Humsavar, [5]) and the Human Gene populations across the world. Five main populations were sampled Mutation Database (HGMD, [6]) collect information on inherited with ancestry in Europe, West Africa, the Americas, East Asia and diseases associated with variants. The Humsavar database South Asia. The project has provided a rich set of synonymous contains disease-associated variants from the literature and (sSNPs) and non-synonymous (nsSNPs) variants for 1092 individ- OMIM. OMIM currently contains information on approximately uals from diverse populations. It is estimated from the 1 kG data 10,200 nsSNPs associated with diseases (December 2011) and that each individual will, on average, differ from the reference Humsavar about 23,500 disease-associated nsSNPs. Most of the human genome sequence at 10,000–12,000 synonymous sites in phenotypical effects and their molecular origins are not well addition to 10,000–11,000 non-synonymous sites [1]. As these established, so predicting the functional effect of a single amino nsSNPs change the amino acid sequence of the protein, the acid variant is of great medical interest. The main methods assume changes have the potential to affect the structure and function of that mutations in highly conserved residues cause disease and thus, the corresponding proteins. The 1000 Genomes Project data set is by using alignments to homologous sequences and residue valuable in that it is large and not derived from a disease cohort similarity, the severity of the variant can be gauged. More but rather seeks to capture variants found in a disparate set of advanced methods include information derived from protein healthy individuals. This can be used to characterise differences on structures (such as solvent accessibility, free energy changes, average between disease-associated and benign mutations (or at environment specific substitution tables and functional annota- least mutations not known to be associated with disease) as well as tions) to improve the accuracy (see review by [7]). The advantage PLOS Computational Biology | www.ploscompbiol.org 1 December 2013 | Volume 9 | Issue 12 | e1003382 Amino Acid Mutation Characteristics substitution matrices [23,24] that incorporate structural con- Author Summary straints. Subramanian and Kumar [25] did a detailed analysis on a In this paper we compare the differences between ‘natural’ set of 8,627 disease-associated mutations and found that disease- and disease-associated amino acid variants at both associated mutations tend to occur on inter-species conserved sequence as well as structural levels. We used data from residues. The common factor between these studies is that they try the 1000 Genomes Project (1 kG), the OMIM database and to understand the effect that selection and structural constraints UniProtKB Humsavar. The results highlight the complex have on disease vs non-disease states in selected sets of proteins. interplay of features from the level of the DNA up to Very few studies have tried to unravel the underlying cause for protein sequence and structure. The codon CpG dinucle- mutation patterns seen in human proteins. With this work we aim otide content plays a large role in determining which to elucidate why certain amino acids mutate more and try to amino acids mutate. This in turn affects the mutability of understand the underlying mechanisms present in the mutation amino acids and a clear difference was seen between non- process. We gather the data for all the amino acid mutations found disease and disease variants where amino acids that are in the 1000 Genomes Project to characterise their sequence and naturally very mutable show the opposite trend in the structural properties, providing a benchmark background against disease-associated data. The current results show evidence which to compare the disease-associated nsSNPs in OMIM and for some selection, mainly in that the variants occur Humsavar. slightly more often on the surface of the protein and are much less likely to be annotated as functional than expected by chance. However we should note that even Results the best definition of functional, taken from structural The 1000 Genomes Project data were queried to retrieve all the data, is limited. Even with these caveats, it is clear that the nsSNPs, which were filtered to include only those that occurred in 1 kG variants eschew functional residues as defined here, a a single population (see methods). This ensures that only the more trend which is surprisingly even stronger in the OMIM data. recent mutational events in human evolution are included and simplifies counting. In addition variants at a single site were only counted once even if they occur in multiple individuals, since such clusters are assumed to represent a single variation event that has of using a 3D approach for prediction is that the consequence and been inherited in the other individuals. For 3D analysis only characteristics of the variant can be studied in its specific human proteins, for which complete structures are available, were environment in the protein. This provides a level of information included to ensure accurate analysis of 3D features. For solvent beyond a sequence or a sequence alignment [8]. If there are accessibility calculations, a monomer subset was also generated to ligands present, the interaction between the mutated amino acid avoid problems with uncertain multimeric states and validate our and the ligand can be studied. This has been successfully applied findings on the larger dataset. Homology models based on close to various individual proteins on a case-by-case basis [9,10]. In relatives were used to extend the data set and see if the trends total over 30 different programs to predict the effects of these observed in the experimental structures were preserved. Table 1 variants have been published, including Condel [11], SNAP [12], summarizes the five data sets created and used in this study. SDM [13], PolyPhen [14], VEP [15], SIFT [16,17] and SNP&GO [18]. Most of these algorithms can only predict whether a specific The amino acid exchange matrix derived from the 1000 variant will be neutral or deleterious for the protein with various degrees of accuracy, although measuring accuracy is challenging Genomes Project dataset in the absence of a good benchmark. Figure 1 shows the amino acid exchange matrix generated from To allow the accurate prediction of functional effects of SNPs, the ,106,000 nsSNPs found in the 1 kG data. Amino acid we need a thorough understanding of why amino acids mutate in mutations requiring two or three base changes are not defined in humans. Various groups have worked on the effect of the this dataset due to technical reasons. The 1 kG matrix exhibits mutations and numerous studies have been done on small specific several interesting features, most of which reflect the genetic code sets of proteins [8,19–22]. Blundell and co-workers have found and the differential mutability of various codons. All possible single that the local environment around an amino acid plays a large role base changes are observed. The matrix is not symmetrical as a in the effect that selection has on a mutation in a specific position result of the differences in frequency of occurrence of amino acids [21]. This has led to the development of environment specific as well as differences in their mutabilities [26,27]. As expected Table 1. The different datasets constructed and used in this study and their composition. Data set Protein chains nsSNPs Description 1 kG 19,058 106,311 A data set containing all the 1 kG variants filtered by population. OMIM 19,058 10,151 A protein sequence based set containing OMIM variants for all reviewed UniProt human proteins. Humsavar 19,058 23,846 A set based on human disease polymorphisms from UniProt. 3D 2,139 10,628 A protein 3D structure based set consisting of 1 kG variants for proteins that have a complete structure in the PDB. Monomer 325 1,461 A subset of the 3D set containing only proteins identified as being monomeric. Model 2,630 13,037 A set based on human ModBase homology models where sequence coverage and identity are between 90–100%. doi:10.1371/journal.pcbi.1003382.t001 PLOS Computational Biology | www.ploscompbiol.org 2 December 2013 | Volume 9 | Issue 12 | e1003382 Amino Acid Mutation Characteristics Figure 1. The amino acid exchanges observed in human protein variants. The 1 kG data set is the top row of each cell and OMIM the bottom row of each cell*. Amino acids are arranged by 1 letter code according to increasing hydrophobicity (least hydrophobic is left and most hydrophobic is right) using the Fauche`re and Pliska scale [58]. Yellow blocks indicate mutations where there are statistically significant differences between 1 kG and OMIM. Blue blocks indicate where no mutations were present in the 1 kG data set. White blocks show where there are no statistically significant differences. Green blocks show where there are proportionally more 1 kG mutations compared to OMIM. Orange blocks show where there are proportionally more OMIM mutations than 1 kG. The mutability scores (see methods) for the 1 kG and OMIM sets are shown in the last column. Note that these matrices are fundamentally different. The 1 kG data set gathers all the observed mutations in the 1 kG project, counting each only once; the OMIM data set combines information gathered from potentially many individuals but filtered to identify those mutations associated with a disease. doi:10.1371/journal.pcbi.1003382.g001 there is a strong correlation (r = 0.786) between the frequency of most mutable, whilst the more chemically complex amino acids, occurrence of amino acids in the human proteome and the Trp (0.004) and Phe (0.005) have the lowest mutabilities. There is number of associated codons. Figure 2 shows that, excluding Arg no correlation in the 1000 Genomes data between mutability and and Leu which are extreme outliers, there is a strong trend for frequency of occurrence (r =20.003 excluding Arg) nor between amino acids with a higher frequency of occurrence to have more mutability and the number of codons (Figure 3). It is well known mutations (r = 0.836). Taken together this leads to a relatively that CpG dinucleotides in DNA tend to mutate at rates 10–50 strong correlation (r = 0.741) between the number of codons and times higher than other dinucleotides [28,29] and thus amino the number of mutations. In contrast, the frequency of the gained acids with a CpG present in their codons will mutate with a higher amino acids, resulting from the mutation, shows little correlation probability (see Figure 4). Four out of the six codons for Arg between frequency of occurrence and number of mutations include CpG sequences, and Arg mutates more frequently than (r = 0.349). any other residue, with a mutability (0.031) which is over twice as high as its nearest rival. This high mutability also reflects the fact Amino acid mutabilities that the CpG in the Arg codons occur in the non-wobble positions The mutabilities of the amino acids (see methods) in the 1 kG so nucleotide mutations give rise to non-synonymous SNPs. In contrast Leu which also has six codons, none of which contain dataset are shown in the last column of Figure 1. Arg (0.031) is the PLOS Computational Biology | www.ploscompbiol.org 3 December 2013 | Volume 9 | Issue 12 | e1003382 Amino Acid Mutation Characteristics Figure 4. A visual representation of the asymmetry of the 1 kG Figure 2. Comparison of the number of mutating residues vs data. The plot shows the difference between how often an amino acid the amino acid frequency of occurrence. mutates vs how often it is mutated to. These are raw counts and also doi:10.1371/journal.pcbi.1003382.g002 reflect the frequency of occurrence. Each amino acid is coloured according to CpG content. Red: a CpG dinucleotide occurs in its codons; CpG, has a low mutability (0.005) and mutates six times less yellow: if one of its codons start with a G (with a C possibly preceding it); blue: no CpG in its codons. The black line indicates the diagonal frequently than Arg. However the correlation with CpG is far where ‘mutations to’ equals ‘mutations from’. from perfect and other factors must have an effect. For example, doi:10.1371/journal.pcbi.1003382.g004 Met, which has only one codon with no CpG dinucleotide, is the second most mutable amino acid (0.014). Figure 4 shows the clear pattern of amino acid gain and loss in different pattern of amino acid mutabilities, compared to the the human proteome. Jordan [26] and Zuckerkandl [30] long overall trend with correlation coefficients equal to 1.0 (Figure S1). since identified that Cys, Met, His, Ser and Phe are being accrued Using the individual amino acid mutabilites, we looked at significantly in the human proteome. Our data confirm a net gain aggregate protein mutability differences by adding up the of these five amino acids, and Val, Asn, Ile and Thr were also individual mutabilities for every amino acid in each protein in confirmed as weak gainers. Jordan and co-workers also identified the data set and normalising by protein length. This was compared strong losers and our data again confirm that Pro, Ala, Gly and to the aggregate mutabilities of proteins involved in disease as Glu are strong losers. Lys was identified as a weak loser but our classified by OMIM and Humsavar. The average score for larger dataset suggests that lysine should be considered a weak disease-associated proteins was 0.0103 and for non-disease gainer in humans. Arg is the strongest loser in the human genome proteins 0.0102 with a median of 0.01022 (s = 0.0006) and (similar to the human set in [26] but not other considered species). 0.01018 (s = 0.0005), respectively, indicating that protein aggre- We calculated the mutability for every amino acid on a gate mutability has no bearing on disease-association (Figure S2). population specific basis. None of the populations showed a The effects of physicochemical characteristics of the amino acids on their mutability As well as constraints on the mutational process at the DNA level, the consequence of a variant on the protein structure and function will also have an impact on the number of observed mutations. If a variant interferes with the structure and function of a protein and that protein is essential, then this variant is less likely to be seen. However comparison of mutability with the size and hydrophobicity of the amino acid shows very little correlation in the 1 kG dataset. There is a moderate anti-correlation between higher mutability and size (r =20.474), with the smaller amino acids mutating more frequently, but no correlation at all between mutability and hydrophobicity (r =20.082) although the large hydrophobic amino acids (Leu, Phe and Trp) have the lowest mutability scores. Trp has the fewest mutations (544, even though all SNPs in Trp codons result in a change of amino acid) and also the lowest mutability score (0.004) together with Phe. In addition to their complexity and low abundance, Phe and Trp often occur in specialized roles such as the interior of proteins, p-p stacking or ring interactions and this might add to their low mutability. The Figure 3. Amino acid mutability vs the number of codons in the mutability of Cys is also low, perhaps reflecting its role in 1 kG data. doi:10.1371/journal.pcbi.1003382.g003 disulphide bridges, which help to stabilise extracellular proteins. PLOS Computational Biology | www.ploscompbiol.org 4 December 2013 | Volume 9 | Issue 12 | e1003382 Amino Acid Mutation Characteristics The structural properties of 1000 Genomes variants Do natural mutations occur in functionally annotated To investigate the structural characteristics of these variants, residues? three sets of protein structures were compiled, namely the 3D set, Functional annotation for each human protein was derived the monomer set and the model set (Table 1). The 3D and using SAS (Sequence Annotated by Structure, [33]). Table 2 monomer set were constructed from data in the PDB (see methods) shows the different functional annotations for each set. The while the model set and the subsequent variant modelling was vast majority of functional annotations identified, make created and performed using Modbase [31] and Modeller [32], built contacts to ligands (using PDBsum data, [34]) or site into an in-house homology modelling pipeline. The 3D set contains interactions in the proteins (as defined in the PDB). Only 2,139 protein chains. A total of 10,628 1 kG nsSNPs were found in 15.5% of the mutations (1,648 of 10,628) in the 3D set were these chains, of which protein models, based on the known annotated with a function compared to 29.1% of all residues in structures of human proteins could be built for 5,524. The the set of human structures (Figure 5C). These data show that monomer set contains 325 protein chains identified as monomers the observed mutations in the 1000 Genomes occur less and a total of 1,461 1 kG nsSNPs were found, of which 897 could be frequently in the functionally annotated residues compared to modelled. The model set, including models based on homologues all residues. from the PDB, contained 2,630 protein chains and 12,432 out of 13,037 nsSNPs could be modelled. For the Humsavar set we found Residue conservation 5,592 nsSNPs of which 3,942 could be modelled. Residue conservation scores, defined as the variation of the Figure 5A shows a comparison of the solvent accessibility residues at a given site in the protein across multiple species, were distribution for all residues compared to that for the variants. On obtained for all sites in the human proteome (where sufficient data average the variants in the 1 kG are slightly more exposed. An are available) from the Evolutionary Trace server [35]. These analysis of the solvent exposed residues found that, for the most scores are distributed across the whole range of conservation accurate monomer set, 79% of nsSNPs are solvent exposed (Figure 6) with a mean score of 0.48. The scores for all the sites compared to 73% of all residues (p = 0.001). For the structures in with mutations in the 1000 Genomes data show a slightly different the model set, 81.9% of nsSNPs were solvent exposed. For all distribution from all residues, with a small but significant shift three datasets, the 1 kG variants have a slight preference to occur (p,2.2610 ) towards the less conserved sites and a reduced on the surface of proteins compared to all residues. Figure 5B mean conservation score of 0.43. Clearly natural variation occurs shows that there were no appreciable differences in secondary across all conservation levels and is not limited to non-conserved structure preferences between variants and other residues. residues. Figure 5. Site properties for all residues, 1 kG nsSNPs, OMIM nsSNPs and Humsavar nsSNPs in the structure 3D set. (A) the solvent accessibility for the variants in the four datasets, (B) the secondary structure in which each of the variants occurs, (C) the functional annotation of every variant in the four datasets. doi:10.1371/journal.pcbi.1003382.g005 PLOS Computational Biology | www.ploscompbiol.org 5 December 2013 | Volume 9 | Issue 12 | e1003382 Amino Acid Mutation Characteristics Table 2. The various functions assigned to nsSNPs in each set. Set Site Ligand Site/ligand overlap Metal Catalytic Overall (non-redundant) 3D 1,414 1,432 1,220 334 17 1,648 (15.5%) Monomer 281 273 245 83 4 312 (21.4%) OMIM 163 184 147 17 17 209 (2.1%) Humsavar 305 285 252 58 41 355 (51.2%) Models 1,538 1,443 1,304 376 36 1,676 (12.9%) ‘Site’ refers to residue specific annotations made by depositors of PDB structures, ‘Ligand’ refers to residues involved in binding a ligand, ‘Metal’ refers to residues coordinating with metals and ‘Catalytic’ to residues involved in the catalytic activity of the protein. The % of non-redundant assigned residues that are ‘functional’ is also shown. doi:10.1371/journal.pcbi.1003382.t002 shows the distribution of changes in energy of the whole protein Amino acid exchange characteristics in 1000 Genome caused by each mutation, evaluated as the statistical potential data energy DOPE score (Discrete Optimised Protein Energy) in For each amino acid the mutation profile can be calculated Modeller. 68.1% of the 1 kG variants increase the DOPE score showing the preference for specific X =.Y mutations in the 1000 (i.e. make the protein less stable). This implies that most natural Genomes data. These profiles, given for all the amino acids in variants decrease the stability of the protein, albeit by a very small Figure 7, show that there are striking differences in frequency of amount. The distribution of changes in size and hydrophobicity occurrence for the different exchanges. For example, in the 1 kG set Arg shows a strong preference to mutate to Gln and His, whilst for all observed mutations (Figure 8B and 8C) show that 59.4% of mutations to Ser, Gly and Pro are much less frequent. All the mutations increase the hydrophobicity of the amino acid and amino acids show these differential exchange rates. Figure 8A 52.4% of mutations increase the size. Over 84% of variants Figure 6. Comparison of the conservation scores in the four sets used. The density distribution of residue conservation scores for all the amino acid positions in UniProt (9,532,474 residues, black), 1 kG (185,428 residues, blue), OMIM (8,099 residues, red) and Humsavar (21,446 residues, green). The conservation scores range from 0 for non-conserved residues to 1 for highly conserved residues. doi:10.1371/journal.pcbi.1003382.g006 PLOS Computational Biology | www.ploscompbiol.org 6 December 2013 | Volume 9 | Issue 12 | e1003382 Amino Acid Mutation Characteristics Figure 7. Comparison of the differences in observed mutations in the various sets. Comparison of the differences in the % of observed mutations in the 1 kG (blue) and OMIM (red) sets for one amino acid mutating to all others e.g. proportionally, more mutations from Lys to Glu are recorded in OMIM than in the 1 kG set. Each plot shows the results of mutation from a specific amino acid (e.g. Arg at top left) to every other amino acid. doi:10.1371/journal.pcbi.1003382.g007 compare these inter-species matrices with the 1 kG intra-species change their size by less than 50 Da. 72% of variants change their hydrophobicity by less than 1 unit. Extreme changes are rare. At matrix (Figure 9A–C). The 1 kG matrix was built using data where the direction of the mutations is known whereas all other this stage these observations provide empirical expectation rates matrices were calculated assuming direction is unknown. This was for amino acid exchanges in humans and result from the genetic compared to the WAG [37] and PAM matrix [38]. To check that code, the nucleotide exchange rates and also some selection at the any differences between the 1 kG matrix and the other matrices protein level. However without a good random model it is difficult are not caused by using direction, a directionless matrix has also to be confident about the importance of the different contributions been included in the plot (Figure 9D). In this plot, principal to such variation. component one clearly separates the 1 kG matrices, which are placed very close together, from all of the previously calculated Comparison of 1000 Genome variants with those matrices. Principal component two then spreads matrices out predicted by the PAM and WAG mutation matrices based on whether the alignments used to build them are made up The 1 kG counts matrix is a snapshot of mutations that have mainly of exposed or buried domains, with the mitochondrial occurred in humans in a short period of time. To understand this matrices at the one extreme built from nearly all membrane process the count matrix can be converted into an instantaneous proteins, and matrices built from only exposed regions of proteins rate matrix describing the rates of change of each amino acid in at the other. humans in a time-independent manner [36]. Instantaneous rate A difference between the intra-species data and the inter-species matrices have previously been built from a wide selection of matrices is the amount of selection which has occurred. Due to the protein alignments across many species including nuclear proteins, time-scale for the 1 kG data and the relatively weak selection in mitochondrial proteins, chloroplast proteins, buried protein human populations [39,40] the only mutations which are not domains and exposed protein domains. PCA can be used to observed are lethal mutations. This means that there should be a PLOS Computational Biology | www.ploscompbiol.org 7 December 2013 | Volume 9 | Issue 12 | e1003382 Amino Acid Mutation Characteristics PLOS Computational Biology | www.ploscompbiol.org 8 December 2013 | Volume 9 | Issue 12 | e1003382 Amino Acid Mutation Characteristics Figure 8. Comparison between the physicochemical properties of the wildtype and the mutant models for each of the data sets. Plots showing the differences between (A) Modeller DOPE scores for the wild type and mutant model (based on 3D, 10,628 mutations, and Humsavar sets, 21,446 residues), (B) changes in hydrophobicity between wild type and mutant in both sets and (C) changes in size between wild type and mutation in both sets. doi:10.1371/journal.pcbi.1003382.g008 Figure 9. Bubble plots comparing the relative differences between the instantaneous rate change matrices of the data sets. (A) 1 kG data, (B) PAM matrix and (C) WAG matrix. (D) A PCA (first two components) plot showing the separation of the 1 kG matrices from other matrices. Matrices included are 1 kG (with and without assuming direction), nuclear (WAG, JTT, LG, PAM, tm126, PCMA), mitochondrial (mtREV24, mtMam, mtArt, mtZoa), chloroplast (cpREV, cpREV64), exposed (alpha helix, beta sheet, coil, turn) and buried (alpha helix, beta sheet, coil, turn). Principal components one and two represent 34% and 20% of the variance, respectively. All other principal components represent 9% or less of the variance each. Amino acids are arranged according to increasing hydrophobicity. doi:10.1371/journal.pcbi.1003382.g009 PLOS Computational Biology | www.ploscompbiol.org 9 December 2013 | Volume 9 | Issue 12 | e1003382 Amino Acid Mutation Characteristics limited effect of selection on the 1 kG matrix. By using no allele (,10,000), is approximately ten times smaller than the number of frequency cutoff for the minor alleles when building the count 1000 Genomes mutations. The normalised OMIM counts that matrix, we gather the maximum amount of information about the differ from the 1 kG dataset are coloured in Figure 1. Considering mutation process. The counts are necessarily shaped by mutation just the residue type, if we exclude Arg, the overall correlation between the normalised frequencies of occurrence of the mutated and selection but will mostly reflect the mutation process. The residues in the two datasets is only 0.14 and between 1 kG and inter-species matrices (e.g. PAM and WAG in Figure 9B,C) on the Humsavar it is 0.48. If we compare all 148 observed X =.Y other hand are subject to selection pressures. This could explain frequencies, the correlation between 1 kG and OMIM is 0.51 and why the 1 kG matrix is so different from the other matrices. One 1 kG and Humsavar is 0.79. clear factor is CpG hypermutability: for example, changes from Previous studies have found that mutations from Arg and Gly Arg, an amino acid with four of six codons containing a CpG, are the major contributors to human genetic disease and have have a very high rate in the 1 kG data, and not in WAG been shown to make up about 30% of the mutations involved in (Figure 9A,B). In fact only codons containing a CpG have high disease [41]. In this updated and much expanded set, variants rates overall (Figure 10). The most plausible explanation is that from Arg and Gly only make up 15% of the disease causing these CpG mutations are occurring at a very high rate and then mutations. However mutations to Arg are still the biggest are selected out so that the effect is not seen as strongly when contributor to genetic disease with ,19.4% of all mutations. looking across multiple species. Figure 11 shows a rank order comparison between the frequency of occurrence of the 1 kG and OMIM variants Comparison between the 1000 Genomes variants and (r = 0.09) as well as between 1 kG and Humsavar (r = 0.31) and the disease-associated variants Humsavar and OMIM (r = 0.51), normalised for amino acid For comparison, we have constructed the amino acid exchange occurrence. Unlike for the 1 kG data, the disease-associated counts matrix for data from the OMIM database and the associated variants show moderate inverse correlations between their plots for these mutations (Figures 1–8). Disease variants from the frequency and the frequency of occurrence of the residue type UniProtKB/Swiss-Prot Human polymorphisms and disease muta- (r =20.67) implying that, at least for OMIM, the mutations to the tions index (Humsavar) were also included with plots available in the rarer amino acids (with fewer codons) are more likely to be supplement (Figures S3, S4, S5). Our focus however is on the associated with disease. As with the 1 kG data there is no strong OMIM set. In contrast to the 1 kG data, various double and triple correlation between a residue type being associated with a disease base mutations are observed in the OMIM set, however the three in the OMIM data and the number of codons. For hydrophobicity triple base changes (Phe-Lys, Met-Tyr and Trp-Ile) were checked and size, the disease associated variants show the opposite trend to back to the publications and all were found to be errors either in the the 1 kG dataset with a moderate correlation between lower paper or in OMIM and were removed. 82 two base changes were frequency and smaller size (r = 0.528, excluding Cys and Trp) but found in OMIM and a few (10%) randomly selected changes were no correlation between frequency and hydrophobicity (r = 0.289). manually checked with no errors found. Clearly the OMIM data It is interesting to note that the least mutable amino acid in the are radically different from the 1000 Genome data, in that they are 1 kG data (Trp) turns out to be the residue whose mutation is most all independent observations of variable confidence and manually likely to result in disease in the OMIM variants and is highly determined by individual scientists. They only represent a small ranked in the Humsavar set. Trp, the largest amino acid, often fraction of disease-associated nsSNPs and the number of mutations occurs in specialized roles in proteins as does Cys, the second most frequent variant residue type in OMIM. Amino acids with a lower frequency of occurrence tend to be the more complex amino acids and are frequently found in specialized roles. Mutating them will result in the possible loss or alteration of protein function, hence the over-representation in OMIM and Humsavar. In a number of cases the OMIM and 1 kG variant preferences appear to behave in an opposite way from one another e.g. in Figure 7 Arg most frequently mutates to Gln in the 1000 Genomes and a variantion to Gly is much less common, whilst Arg to Gly is the most common variant in the OMIM dataset and a variation to Gln is rare. We observe a reasonable correlation between the OMIM and Humsavar mutabilities (r = 0.51), but some amino acids appear to behave completely differently in the two datasets. Gly and Ala are much more frequently mutated in the Humsavar set than in OMIM, whilst Gln, Lys and His have mutabilities in the Humsavar set similar to those observed in the 1 kG dataset and much smaller than those in OMIM. This may reflect the larger Humsavar dataset (but this seems unlikely since Gly and Ala are quite common amind acids), so these specific discrepancies may rather reflect the origins of mutations in the two separate datasets. Structural properties of disease-associated nsSNPs The disease-associated OMIM variants show a slight preference Figure 10. Dependence of mutation rates on the change in CpG for buried sites (33%) compared to all residues (27%) in the human status. Rates of change from codons were calculated similarly to the proteome (Figure 5A) is even stronger in the Humsavar data amino acid rate matrix [36], but on a 61 by 61 codon matrix. doi:10.1371/journal.pcbi.1003382.g010 (41%). This contrasts with the ‘natural’ variants of the 1 kG data, PLOS Computational Biology | www.ploscompbiol.org 10 December 2013 | Volume 9 | Issue 12 | e1003382 Amino Acid Mutation Characteristics Figure 11. Amino acid mutability rank order plot comparing the mutability scores for 1 kG, OMIM and Humsavar residues. The most mutable amino acids are at the top. Correlation coefficients for 1 kG vs OMIM, 1 kG vs Humsavar and OMIM vs Humsavar are 0.09, 0.17 and 0.51, respectively. doi:10.1371/journal.pcbi.1003382.g011 which show a decreased preference (18%) for the interior. Our and stability or are involved in as yet unidentified protein-protein work broadly agrees with a smaller study done by Gong and interfaces. Blundell [21] that showed 60–65% of disease associated nsSNPs are solvent exposed. We found an almost identical distribution of Conservation OMIM and Humsavar variants compared to all residues and the There is a clear difference in the conservation score distribution 1 kG variants between the different secondary structures between natural variants and the OMIM and Humsavar variants (Figure 5B). (Figure 6). The natural variants occur across the entire range of Figure 8A shows the differences in the DOPE scores [42] conservation but the OMIM and Humsavar variants show a peak calculated for each variant during the structural modelling process in the more conserved residues. This is consistent with the idea for the 1 kG, OMIM and Humsavar datasets. The distribution for that mutations in conserved residues often lead to disease. the disease-associated variants is shifted towards larger positive energies in both datasets, indicating that the variants destabilize Discussion the protein slightly more than the non-disease variants. In contrast to the 1 kG data, OMIM mutations are more likely to increase The results presented herein are subject to a few caveats, the polarity (54%) and more likely to decrease size (51.6%, most serious being related to the limited and possibly biased Figure 8B,C). The two datasets show some detailed differences disease-associated data in OMIM. There are only ,10,000 in size and hydrophobicity changes. The Humsavar variants less variants in our OMIM set and these have variable experimental frequently reduce size or decrease hydrophobicity compared to validation, and may indeed be biased according to scientists’ OMIM mutations. preconceptions that such mutations should correspond to the residues that are most conserved and the amino acid exchanges that generate the largest changes in physicochemical characteris- Functional annotations In the OMIM set, 11.2% (209 of 1,864) of the modelled tics. The Humsavar set has over 23,000 disease variants, however the requirements for inclusion are based on an annotation of mutations were annotated with a function (Figure 5C and methods). This is less than the distribution for all residues ‘involvement in disease’. This annotation is derived from either OMIM annotations or associations found in literature during (29.1%) and that seen for the 1 kG variants (15.5%). For the Humsavar data this drops to only 6.5%. This is a surprising curation of the SwissProt data. Notwithstanding, the OMIM dataset is one of the best available at the present time, although the finding, which needs further validation. It implies that most disease-associated mutations do not have a direct effect on the coming years will see major expansion and hopefully improve- proteins’ catalytic or binding sites but instead act through other, ments in such data. The results highlight the complex interplay of unannotated residues such as those which affect overall structure features from the level of the DNA up to protein sequence and PLOS Computational Biology | www.ploscompbiol.org 11 December 2013 | Volume 9 | Issue 12 | e1003382 Amino Acid Mutation Characteristics structure. The codon CpG dinucleotide content plays a large role There is a small but significant impact of the protein structure in determining which amino acids mutate. This in turn affects the on amino acid mutability, so that natural variants occur slightly mutability of amino acids and a clear difference was seen between more often in non-conserved regions. 59.4% of variations increase non-disease and disease variants where amino acids that are the hydrophobicity of the amino acid and 52.4% increase its size naturally very mutable, show the opposite trend in the disease- in the natural set, while OMIM variants often result in larger associated data. changes in the size and hydrophobicity of the amino acid and are more destabilising on average than 1 kG variants. The Humsavar The data for the 1000 Genomes provides a new experimental baseline against which amino acid profiles may be compared. data supports this idea that disease variants result in more extreme changes. The selection pressures captured in the WAG and PAM Although there might be sequencing biases due to the DNA sequencing techologies used [43], every effort has been made by matrices ‘purify’ out the ‘natural’ variants, removing variants with large changes in size and hydrophobicity. The amino acids all the 1000 Genomes consortium to correct for this. They estimate that using consensus calling on data produced by multiple show distinctive exchange profiles, whereby some exchanges are platforms results in an error rate of 1–4%, thus having a small very common and some very rare, which provides an empirical but negligible impact on our results. The current results show expectation for any specific exchange in humans. evidence for some protein selection, mainly in that the variants As the cost of sequencing drops rapidly, many more genomes occur slightly more often on the surface of the protein and are will be sequenced and experimental validation of disease-causing much less likely to be annotated as functional than expected by mutations will improve as a result of more data. Much better chance. However, we should note that even the best definition of codon-based models of evolution will be attainable, allowing in functional, taken from structural data, is limited. At one level, the turn a better dissection of the impact of selection at the protein definition is rather broad. For example, all residues in contact with level. The data herein will be used to develop an improved method a ligand are described as functional, but this is a major to predict the effects of individual mutations, to explore cancer- underestimate since many cognate ligands are not present in the related amino acid mutations, to investigate and compare crystal structures and similarly protein-protein interactions are mutational profiles in different organisms as well as improving rarely captured. In addition there are still relatively few complete codon mutation models for human DNA. structures for human proteins, which makes analysis of the effects of variants more difficult. Methods Even with these caveats, it is clear that the 1 kG variants Non-synonymous mutations in humans eschew functional residues as defined here, a trend which is surprisingly even stronger in the OMIM and Humsavar data. UniProt [5] was queried for all reviewed protein sequences belonging to Homo sapiens. 19,058 entries were retrieved. The The preference for OMIM mutations to be more buried and less functional supports the suggestion that these variants predomi- Ensembl transcript ID [45] was obtained for each protein sequence using the mapping provided by UniProt (17,708 UniProt nantly affect the structure and stability of the protein [4]. This is a similar result to that found by Sunyaev and co-workers [44] on a entries were mapped to 40,351 Ensembl transcript IDs). Immu- noglobulins and major histocompatibility complex proteins were much smaller set. They found that 35% of disease variants were buried and a more detailed analysis found that ,70% of the excluded as they are inherently variable. For every protein, the variants are located in structurally and functionally important Ensembl v67 Perl API was used to query the transcript ID in regions. Therefore these disease-associated mutations may well Ensembl for nsSNPs found in the 1 kG data set (as available on 1 target residues that are remote from the active site, which August 2012). To reduce the inherent uncertainty involved in determining the ancestral allele, only mutations that occurred in modulate rather than obliterate the function of the protein. For example, for an enzyme, the primary catalytic residues are rarely one of the 1000 Genomes described populations were used, with the allele present in all populations considered the ancestral, hence targeted, but the ‘secondary’ residues in the interior (which affect stability) or on the surface, which may affect protein-protein defining the direction of the mutation. This increases the chances that the variant found in the 1 kG data is a mutation away from interactions, could modulate function. However, the higher than average conservation scores for OMIM and Humsavar sites the ancestral genome. 106,311 mutations were found and this data set, containing the ‘natural’ variants found in the 1 kG project, will suggest that these disease-associated residues, although not defined as ‘functional’, are still important for the organism. This be referred to as the 1 kG set. needs further investigation, with particular attention to how Residue conservation scores for each residue in every protein ‘functional’ residues are defined and whether we can improve on sequence were calculated using the Evolutionary Trace server this definition. [35]. Conservation scores for 2,274 sequences could not be calculated due to the methodology used by the Evolutionary Trace Bringing together all the above observations for disease- associated and natural variants in 1000 humans, we observe server that disregards residues in columns of the multiple alignment containing more than 60% gaps and ranked as being that the mutability of amino acids is largely driven by the properties of the DNA and mutational mechanisms, which favour non-conserved, as well as residues judged by the algorithm not to mutations at codons containing a CpG dinucleotide. Therefore have enough information. This process almost certainly preferen- mutations to Arg residues are more than twice as common as any tially excludes surface residues (where insertions and deletions are other mutation. However there are clearly other factors at play, most common) but since we are using the conservation distribution which determine the frequency of variants, even at the DNA level. for comparisons, this bias is not significant. The UniProt sequences Although the disease-associated variants (both OMIM and were used to calculate the relative abundance of amino acids in Humsavar) follow the same pattern as the 1 kG variants (i.e. the human proteins. A total of about 10.5 million amino acids were same mutations are present in both sets, as dictated by the genetic counted. For each protein sequence, the OMIM Mutations search code), the rank order of amino acids, according to their probability tool (http://www.bioinf.org.uk/omim) was queried with the of being disease-associated, is radically different from that UniProt entry ID to retrieve variants found in OMIM. Only expected on the basis of the 1 kG data, with some of the rarer variants for which the correct amino acid position in the protein amino acids being shifted to the top of the list. has been verified, were used for the OMIM data set and will be PLOS Computational Biology | www.ploscompbiol.org 12 December 2013 | Volume 9 | Issue 12 | e1003382 Amino Acid Mutation Characteristics referred to as the OMIM set. 556 of the OMIM mutations were each of the proteins in the structure data sets were retrieved from the PDB. To maintain consistency between the PDB and UniProt found in the 1 kG set (0.5%). Although these represent a very small fraction we removed them so that they did not bias the residue numbering, the SIFTS mapping [57] for each protein chain was used. NACCESS was used to calculate the relative results. solvent accessibilities for the individual residues in a chain. A cut- The instantaneous rate change matrices were derived using the off of 5% solvent exposure was used to distinguish between buried DCFreq method [36] and the human proteome frequencies. and exposed residues. Mutability of amino acids Mapping nsSNPs to structures A mutability score for every amino acid was calculated by taking To investigate the effect a nsSNP might have, each individual the total number of mutations for a specific amino acid in the data nsSNP was mapped to its correct amino acid in the protein and dividing by the frequency of occurrence for the specific amino structure. For every such nsSNP that could be mapped, a acid in the human genome. The proportional representation of homology model of the protein containing the nsSNP was built each amino acid in the human proteome is given in supplemental using Modeller 9v3 [32] with the original protein structure serving Table S1. as the template. A maximum of 200 steps of conjugate gradient minimization followed by 200 rounds of molecular dynamics at Statistical validation 300 K (using Modeller) was applied to each variant and its We compared the amino acid variant counts in the 1 kG and structural context analysed. NACCESS was run on all the variant OMIM data using Fischer’s exact test in the R package (R models to identify changes in solvent accessibility. Comparisons of Development Core Team, 2011). Multiple comparison correction the Modeller DOPE score (Discrete Optimized Protein Energy, was done on the p-values for each amino acid using p.adjust in R [42]) were made between the nsSNP model and the reference with the Benjamini-Hochberg-Yekutieli method [46,47]. P-values structure to estimate the magnitude of change that a variant might lower than 0.01 were considered statistically significant. For cause. The 1 kG models are available in PDBsum (http://www. correlation values, r.0.7 and r,20.7 were considered strong, ebi.ac.uk/pdbsum/) by looking at the specific PDB code of 0.4,r,0.7 and 20.4.r.20.7 were considered moderate and interest. 0.3.r.20.3 weak or no correlation. Supporting Information Retrieving human proteins and their structures The protein structure data set was constructed by first taking all Figure S1 Mutabilities of the amino acids for each the above mentioned protein sequences and annotating each with population. AMR: American admixed, ASN: South East their respective Pfam [48] domains. Only proteins for which there Asian, AFR:African, EUR: European. were matching entries in the Protein Data Bank (PDB, [49]) were (EPS) kept. This resulted in a list containing the UniProt identifiers for all Figure S2 The distribution of average protein mutabil- known human proteins that have at least one structure in the PDB. ites for all human proteins (blue) and disease associated For accuracy, the corresponding PDB structures were then filtered proteins (red). to include only X-ray structures. Using the Pfam mapping, only (EPS) protein structures containing all the protein’s Pfam domains were kept. The final list contained 2,139 protein chains and will be Figure S3 The amino acid exchanges observed in referred to as the 3D set. human protein variants. The 1 kG data set is the top row A set consisting only of human monomeric proteins was also of each cell and Humsvar(SP) the bottom row of each cell*. Amino acids are arranged by 1 letter code according to increasing constructed. An algorithm was implemented whereby a protein hydrophobicity (least hydrophobic is left and most hydrophobic is was classified as being either a multimer or a monomer based on a right) using the Fauche`re and Pliska scale. Yellow blocks indicate majority vote. The predictions used were from PISA [50], mutations where there are statistically significant differences UniProt, 3DComplex [51], PIQSI [52], PQS-PITA [53–55], between 1 kG and Humsavar. Blue blocks indicate where no relevant PubMed abstracts and REMARK 350 records from the mutations were present in the 1 kG data set. White blocks show PDB structure file. The oligomeric predictions from each of the where there are no statistically significant differences. Green blocks servers were collected for every protein in the 3D set. Only when show where there are proportionally more 1 kG mutations the majority of the servers agreed on the most probable oligomeric compared to Humsavar. Orange blocks show where there are state of the protein, was it designated as either a multimer or a proportionally more Humsavar mutations than 1 kG. The monomer. The monomeric protein list contained 325 proteins and mutability scores (see methods) for the 1 kG and Humsavar sets will be referred to as the monomer set. are shown in the last column. *Note that these matrices are Another homology-based set was constructed using the human fundamentally different. The 1 kG data set gathers all the models in ModBase [31]. Models with 90–100% sequence identity observed mutations in the 1 kG project, counting each only once; and coverage were used as templates. This set contained 2,630 the Humsavar data set combines information gathered from models and will be referred to as the model set. potentially many individuals but filtered to identify those mutations associated with a disease. Protein chain annotation (EPS) Each protein chain in the 3D, monomer and model sets was annotated with information from various databases and online Figure S4 Comparison of the differences in observed resources. Information about protein properties such as catalytic mutations in the various sets. Comparison of the differences residues, metal-binding residues, ligand-binding residues and in the % of observed mutations in the 1 kG (blue) and Humsavar PROSITE patterns [56] were extracted from PDBsum [34] and (red) sets for one amino acid mutating to all others e.g. additional functional residue annotations were retrieved using SAS proportionally, more mutations from Lys to Glu are recorded in (Sequence Annotated by Structure, [33]). The 3D coordinates for Humsavar than in the 1 kG set. Each plot shows the results of PLOS Computational Biology | www.ploscompbiol.org 13 December 2013 | Volume 9 | Issue 12 | e1003382 Amino Acid Mutation Characteristics mutation from a specific amino acid (e.g. Arg at top left) to every Acknowledgments other amino acid. We would like to thank Angela Wilkins for running the large scale (EPS) conservation analysis, Grecia Lapizco-Encinas for constructing the Figure S5 Comparison of the differences in observed monomer set, Arjun Ray for doing the analysis of the ModBase models and Ewan Birney for valuable discussions. mutations in the various sets. Comparison of the differences in the % of observed mutations in the Humsavar (green) and OMIM (red) sets for one amino acid mutating to all others. Each Author Contributions plot shows the results of mutation from a specific amino acid (e.g. Conceived and designed the experiments: TAPdB RAL SLP BS NG JMT. Arg at top left) to every other amino acid. Performed the experiments: TAPdB RAL. Analyzed the data: TAPdB SLP (EPS) BS. Wrote the paper: TAPdB RAL JMT. Valuable discussion regarding the method design: NG JMT SLP BS. Table S1 The relative abundances of the various amino acids in the UniProt protein set. (PDF) References 1. 1000 Genomes Project Consortium, Durbin RM, Abecasis GR, Altshuler DL, 25. Subramanian S, Kumar S (2006) Evolutionary anatomies of positions and types Auton A, et al (2010) A map of human genome variation from population-scale of diseaseassociated and neutral amino acid mutations in the human genome. sequencing. Nature 467: 1061–1073. BMC Genomics 7: 306. 2. 1000 Genomes Project Consortium, Abecasis GR, Auton A, Brooks LD, 26. Jordan IK, Kondrashov FA, Adzhubei IA, Wolf YI, Koonin EV, et al. (2005) A DePristo MA, et al (2012) An integrated map of genetic variation from 1,092 universal trend of amino acid gain and loss in protein evolution. Nature 433: human genomes. Nature 491: 56–65. 633–638. 3. Iengar P (2012) An analysis of substitution, deletion and insertion mutations in 27. Hurst LD, Feil EJ, Rocha EPC (2006) Protein evolution: causes of trends in cancer genes. Nucleic Acids Res 40: 6401–6413. amino-acid gain and loss. Nature 442: E11–2; discussion E12. 4. Amberger J, Bocchini CA, Scott AF, Hamosh A (2009) McKusick’s Online 28. Walser JC, Furano AV (2010) The mutational spectrum of non-CpG DNA Mendelian Inheritance in Man (OMIM). Nucleic Acids Res 37: D793–D796. varies with CpG content. Genome Res 20: 875–882. 5. UniProt-Consortium (2010) The Universal Protein Resource (UniProt) in 2010. 29. Kong A, Frigge ML, Masson G, Besenbacher S, Sulem P, et al. (2012) Rate of de Nucleic Acids Res 38: D142–D148. novo mutations and the importance of father’s age to disease risk. Nature 488: 6. Stenson PD, Ball E, Howells K, Phillips A, Mort M, et al. (2008) Human Gene 471–475. Mutation Database: towards a comprehensive central mutation database. J Med 30. Zuckerkandl E, Derancourt J, Vogel H (1971) Mutational trends and random Genet 45: 124–126. processes in the evolution of informational macromolecules. J Mol Biol 59: 473– 7. Ng PC, Henikoff S (2006) Predicting the effects of amino acid substitutions on protein function. Annu Rev Genomics Hum Genet 7: 61–80. 31. Pieper U, Webb BM, Barkan DT, Schneidman-Duhovny D, Schlessinger A, et 8. Steward RE, MacArthur MW, Laskowski RA, Thornton JM (2003) Molecular al. (2011) ModBase, a database of annotated comparative protein structure basis of inherited diseases: a structural perspective. Trends Genet 19: 505–513. models, and associated resources. Nucleic Acids Res 39: D465–D474. 9. Fabre KM, Ramaiah L, Dregalla RC, Desaintes C, Weil MM, et al. (2011) 32. Sali A, Blundell TL (1993) Comparative protein modelling by satisfaction of Murine Prkdc polymorphisms impact DNA-PKcs function. Radiat Res 175: spatial restraints. J Mol Biol 234: 779–815. 493–500. 33. Milburn D, Laskowski RA, Thornton JM (1998) Sequences annotated by 10. Minutolo C, Nadra AD, Ferna´ndez C, Taboas M, Buzzalino N, et al. (2011) structure: a tool to facilitate the use of structural information in sequence Structure-based analysis of five novel disease-causing mutations in 21- analysis. Protein Eng 11: 855–859. hydroxylase-deficient patients. PLoS One 6: e15899. 34. Laskowski RA (2009) PDBsum new things. Nucleic Acids Res 37: D355–D359. 11. Gonza´lez-Pe´rez A, Lo´pez-Bigas N (2011) Improving the assessment of the 35. Mihalek I, Res I, Lichtarge O (2004) A family of evolution-entropy hybrid outcome of nonsynonymous SNVs with a consensus deleteriousness score, methods for ranking protein residues by importance. J Mol Biol 336: 1265–1282. Condel. Am J Hum Genet 88: 440–449. 36. Kosiol C, Goldman N (2005) Different versions of the Dayhoff rate matrix. Mol 12. Bromberg Y, Yachdav G, Rost B (2008) SNAP predicts effect of mutations on Biol Evol 22: 193–199. protein function. Bioinformatics 24: 2397–2398. 37. Whelan S, Goldman N (2001) A general empirical model of protein evolution 13. Worth CL, Preissner R, Blundell TL (2011) Sdm–a server for predicting effects derived from multiple protein families using a maximum-likelihood approach. of mutations on protein stability and malfunction. Nucleic Acids Res 39: W215– Mol Biol Evol 18: 691–699. W222. 38. Dayhoff M, Schwartz R, Orcutt B (1978) A model of evolutionary change in 14. Adzhubei IA, Schmidt S, Peshkin L, Ramensky VE, Gerasimova A, et al. (2010) proteins. Atlas of Protein Sequence and Structure 5(3): 345–351. A method and server for predicting damaging missense mutations. Nat Methods 39. Lohmueller KE, Indap AR, Schmidt S, Boyko AR, Hernandez RD, et al. (2008) 7: 248–249. Proportionally more deleterious genetic variation in European than in African 15. McLaren W, Pritchard B, Rios D, Chen Y, Flicek P, et al. (2010) Deriving the populations. Nature 451: 994–997. consequences of genomic variants with the Ensembl API and SNP Effect 40. Akashi H, Osada N, Ohta T (2012) Weak selection and protein evolution. Predictor. Bioinformatics 26: 2069–2070. Genetics 192: 15–31. 16. Ng PC, Henikoff S (2001) Predicting deleterious amino acid substitutions. 41. Vitkup D, Sander C, Church GM (2003) The amino-acid mutational spectrum Genome Res 11: 863–874. of human genetic disease. Genome Biol 4: R72. 17. Ng PC, Henikoff S (2003) SIFT: Predicting amino acid changes that affect 42. Shen MY, Sali A (2006) Statistical potential for assessment and prediction of protein function. Nucleic Acids Res 31: 3812–3814. protein structures. Protein Sci 15: 2507–2524. 18. Calabrese R, Capriotti E, Fariselli P,Martelli PL, Casadio R (2009) Functional 43. Ross MG, Russ C, Costello M, Hollinger A, Lennon NJ, et al. (2013) annotations improve the predictive score of human disease-related mutations in Characterizing and measuring bias in sequence data. Genome Biol 14: R51. proteins. Hum Mutat 30: 1237–1244. 44. Sunyaev S, Ramensky V, Bork P (2000) Towards a structural basis of human 19. Nakken S, Alseth I, Rognes T (2007) Computational prediction of the effects of non-synonymous single nucleotide polymorphisms. Trends Genet 16: 198–200. non-synonymous single nucleotide polymorphisms in human DNA repair genes. 45. Flicek P, Aken BL, Ballester B, Beal K, Bragin E, et al. (2010) Ensembl’s 10th Neuroscience 145: 1273–1279. year. Nucleic Acids Res 38: D557–D562. 20. Reumers J, Schymkowitz J, Rousseau F (2009) Using structural bioinformatics to 46. Hochberg YBY (1995) Controlling the false discovery rate: A practical and investigate the impact of non synonymous SNPs and disease mutations: scope powerful approach to multiple testing. J Roy Statistical Society 57(1): 289–300. and limitations. BMC Bioinformatics 10 Suppl 8: S9. 47. Yekutieli YBD (2001) The control of the false discovery rate in multiple testing 21. Gong S, Blundell TL (2010) Structural and functional restraints on the under dependency. Ann Stat 29(4): 1165–1188. occurrence of single amino acid variations in human proteins. PLoS One 5: 48. Finn RD, Mistry J, Tate J, Coggill P, Heger A, et al. (2010) The Pfam protein e9186. families database. Nucleic Acids Res 38: D211–D222. 22. Kamaraj B, Purohit R (2013) Computational screening of disease-associated mutations in OCA2 gene. Cell Biochem Biophys: 1–13. 49. Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, et al. (2000) The Protein Data Bank. Nucleic Acids Res 28: 235–242. 23. Gong S, Worth CL, Bickerton GRJ, Lee S, Tanramluk D, et al. (2009) 50. Krissinel E, Henrick K (2007) Inference of macromolecular assemblies from Structural and functional restraints in the evolution of protein families and crystalline state. J Mol Biol 372: 774–797. superfamilies. Biochem Soc Trans 37: 727–733. 24. Worth CL, Gong S, Blundell TL (2009) Structural and functional constraints in 51. Levy ED, Pereira-Leal JB, Chothia C, Teichmann SA (2006) 3D complex: a the evolution of protein families. Nat Rev Mol Cell Biol 10: 709–720. structural classification of protein complexes. PLoS Comput Biol 2: e155. PLOS Computational Biology | www.ploscompbiol.org 14 December 2013 | Volume 9 | Issue 12 | e1003382 Amino Acid Mutation Characteristics 52. Levy ED (2007) PiQSi: protein quaternary structure investigation. Structure 15: 56. Sigrist CJA, Cerutti L, Hulo N, Gattiker A, Falquet L, et al. (2002) PROSITE: a 1364–1367. documented database using patterns and profiles as motif descriptors. Brief 53. Ponstingl H, Henrick K, Thornton JM (2000) Discriminating between Bioinform 3: 265–274. homodimeric and monomeric proteins in the crystalline state. Proteins 41: 57. Velankar S, McNeil P, Mittard-Runte V, Suarez A, Barrell D, et al. (2005) E- 47–57. MSD: an integrated data resource for bioinformatics. Nucleic Acids Res 33: 54. Henrick K, Thornton JM (1998) PQS: a protein quaternary structure file server. D262–D265. Trends Biochem Sci 23: 358–361. 58. Fauche`re JL, Charton M, Kier LB, Verloop A, Pliska V (1988) Amino acid side 55. Ponstingl H, Kabir T, Gorse D, Thornton JM (2005) Morphological aspects of chain parameters for correlation studies in biology and pharmacology. Int J Pept oligomeric protein structures. Prog Biophys Mol Biol 89: 9–35. Protein Res 32: 269–278. PLOS Computational Biology | www.ploscompbiol.org 15 December 2013 | Volume 9 | Issue 12 | e1003382

Journal

PLoS Computational Biology – Public Library of Science (PLoS) Journal

Published: Dec 12, 2013

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

Amino Acid Changes in Disease-Associated Variants Differ Radically from Variants Observed in the 1000 Genomes Project Dataset

Amino Acid Changes in Disease-Associated Variants Differ Radically from Variants Observed in the 1000 Genomes Project Dataset

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

Amino Acid Changes in Disease-Associated Variants Differ Radically from Variants Observed in the 1000 Genomes Project Dataset

Amino Acid Changes in Disease-Associated Variants Differ Radically from Variants Observed in the 1000 Genomes Project Dataset

References (122)

Abstract

Journal

Recommended Articles

There are no references for this article.

Our policy towards the use of cookies