Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

A systematic evaluation of normalization methods in quantitative label-free proteomics

A systematic evaluation of normalization methods in quantitative label-free proteomics To date, mass spectrometry (MS) data remain inherently biased as a result of reasons ranging from sample handling to differences caused by the instrumentation. Normalization is the process that aims to account for the bias and make samples more comparable. The selection of a proper normalization method is a pivotal task for the reliability of the downstream analysis and results. Many normalization methods commonly used in proteomics have been adapted from the DNA microarray techniques. Previous studies comparing normalization methods in proteomics have focused mainly on intragroup variation. In this study, several popular and widely used normalization methods representing different strategies in normalization are evaluated using three spike-in and one experimental mouse label-free proteomic data sets. The normalization methods are evaluated in terms of their ability to reduce variation between technical replicates, their effect on differential expression analysis and their effect on the estimation of logarithmic fold changes. Additionally, we examined whether normalizing the whole data globally or in segments for the differential expression analysis has an effect on the performance of the normalization methods. We found that variance stabilization normalization (Vsn) reduced variation the most between technical replicates in all examined data sets. Vsn also performed consistently well in the differential expression analysis. Linear regression normalization and local regression normalization performed also systematically well. Finally, we discuss the choice of a normalization method and some qualities of a suitable normalization method in the light of the results of our evaluation. Key words: proteomics; normalization; label free; bias; differential expression; logarithmic fold change; quantitation; intragroup variation; reproducibility; mass spectrometry Introduction nonbiological sources, which is introduced by small variations The development of mass spectrometry (MS)-based proteomics in the experimental conditions in the course of carrying out the has been rapid. Modern proteomics aims not only to identify MS analysis [4]. These variations include, for example, differ- the proteins but also to quantify them as accurately as possible ences in sample preparation and handling, device calibration or [1]. Current MS-based proteomics workflows are able to detect changes in temperature, but the exact reason of the bias is usu- thousands of proteins, their modifications and localizations in a ally unknown and cannot thus be solely accounted for by ad- single run [2]. Despite all the developments of MS technologies, justing the experimental settings [3, 4]. The observed bias can the data from the MS analysis are still susceptible to systematic be independent or dependent on the measured protein abun- biases [3]. This bias has been defined as variation caused by dances [4]. Tommi Valikangas is a Research Scientist in the Computational Biomedicine Group at the Turku Centre for Biotechnology Finland. He is interested in computational biology and bioinformatics. Tomi Suomi is a Research Scientist in the Computational Biomedicine research group at the Turku Centre for Biotechnology Finland. His research inter- ests include scientific computing and bioinformatics. Laura L. Elo is Adjunct Professor in Biomathematics, Research Director in Bioinformatics and Group Leader in Computational Biomedicine at Turku Centre for Biotechnology, University of Turku, Finland. Her main research interests include computational biomedicine and bioinformatics. Submitted: 9 June 2016; Received (in revised form): 6 September 2016 V The Author 2016. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. 1 Downloaded from https://academic.oup.com/bib/article/19/1/1/2562889 by DeepDyve user on 18 July 2022 2| Valikangas et al. The process that aims to take the bias into account is called Other normalization approaches not covered in this study exist, normalization. Normalization aims to make the samples of the such as the MaxLFQ integrated into the MaxQuant software [13] data more comparable and the following downstream analysis and the normalization integrated into the DeMix-Q software reliable [3]. Many of the normalization methods used for prote- [14]. These normalizations, however, are integral parts of prote- omics data have their roots in the DNA microarray technology omics software workflows as opposed to the stand-alone nor- [4], where several evaluations and reviews have already eluci- malization methods examined in this comparison, with the dated their performance [5–8]. For instance, Bolstad et al. [5] exception of Progenesis normalization. All the normalization compared five normalization methods with DNA microarray methods examined are commonly used methods in proteomics data and concluded that most of them performed rather simi- and have different approaches and assumptions regarding the larly and reduced nonbiological variability across arrays when bias occurring in the data. Three spike-in label-free proteomics compared with the unnormalized data. Choe et al. [6] also found data sets were used for benchmarking the normalization meth- no significant differences between the four normalization ods. The spike-in data sets are suitable for this kind of method methods they examined with an RNA spike-in experiment at testing, as the differences between sample groups are known, the probe level. In previous comparisons in proteomics, and methods can be evaluated in their ability to find the true Callister et al. [9] used three different liquid chromatography– differences and to level out other biologically nonexisting differ- MS (LC-MS) data sets to evaluate four different normalization ences. Additionally, a data set from a mouse study was also methods on peptide level and found a linear regression normal- used to compare the performance of the normalization meth- ization best suited for their data sets. Kultima et al. [10] com- ods in a non-spike-in data set, representing a typical real re- pared 10 different normalization methods with three different search setting. Offline fractionation, which adds another layer peptidomics data sets and noticed that the order of the LC-MS of complexity to normalization, was not used in any of the experiments affected the bias in the data; they suggested that tested data sets. In such cases, the total peptide ion signals of their novel RegrRun normalization, which combines linear re- each fraction are spread over several runs, which should be gression normalization with analysis order normalization, was normalized before summing up the values [12]. the best overall method in reducing unwanted intragroup and intrasample variation. Different tools for helping in the selection of a normalization Materials and methods method have also been proposed. Webb-Robertson et al. [11] Description of the data sets stated that a single method cannot account for the bias in dif- ferent data sets; rather it is crucial for reliable downstream ana- The UPS1 data set lysis to select the appropriate normalization method for each Benchmarking data of Pursiheimo et al. [15] include Universal data set. They introduced a tool called SPANS, which combines Proteomics Standard Set (UPS1) proteins spiked into a yeast eight methods for peptide selection to be used in normalization proteome digest to create concentrations of 2, 4, 10, 25 and with five normalization methods [11]. Chawade et al. [3] also 50 fmol/ll. Three technical replicates of each concentration introduced a tool for choosing a proper normalization method were analyzed using LTQ Orbitrap Velos mass spectrometer. called Normalyzer. Their tool includes several popular normal- The spike-in data are available from the PRIDE Archive with the ization methods such as linear regression, local regression, total identifier PXD002099 (http://www.ebi.ac.uk/pride/archive/pro intensity, average intensity, median intensity, variance stabil- jects/PXD002099). ization normalization (Vsn) and quantile normalization, to- gether with several frequently used evaluation measures used to assess the performance of a normalization method such as The CPTAC data set the pooled coefficient of variation (PCV), the pooled median ab- The CPTAC (Study 6) data set [16] contains UPS1 proteins spiked solute deviation (PMAD) and the pooled estimate of variance into a yeast proteome digest with concentrations of 0.25, 0.74, (PEV) [3]. 2.2, 6.7 and 20 fmol/ll. Three technical replicates of each con- So far, comparisons of normalization methods in proteomics centration were analyzed using LTQ Orbitrap mass spectrom- have typically focused on their ability to decrease intragroup vari- eter (at test site 86). The LTQ Orbitrap@86 spike-in data are ation between technical and/or biological replicates of the test available from the CPTAC-portal (http://cptac-data-portal.geor data. Measures for the intragroup variation such as PEV [3, 9, 10], getown.edu/cptac/dataPublic/list/LTQ-Orbitrap%4086?current PCV [3], PMAD [3], the median coefficient of variation (CV) [9]and Path¼%2FPhase_I_Data%2FStudy6). Sample Group E was left out the median SD [10] have been used to rank the normalization from our analysis, as it had only two technical replicates be- methods compared. While reducing intragroup variation is cer- cause of the Progenesis software being unable to align one of tainly a central goal of normalization, a more thorough compari- the technical replicates automatically. son of the normalization methods and their performance in proteomics is still lacking. Although interesting questions such as differences in the correct detection of truly differentially expressed The SGSD data set proteins in the data normalized by different normalization meth- The profiling standard of Bruderer et al. [17] contains 12 nonhu- ods have been investigated before [3, 12, 13], a thorough system- man proteins spiked into a constant human background (HEK- atic analysis using multiple data sets and two-group comparisons 293). It contains eight different sample groups with known con- has not been available in proteomics. Also, the effect of the nor- centrations of the spike-in proteins. Each of the samples con- malization method on the estimation of the logarithmic fold tains three replicates, which have been analyzed both in data- change (logFC) or the effect of how the normalization is performed dependent acquisition (DDA) and data-independent acquisition when comparing only two sample groups from a larger data set modes. We used the DDA shotgun proteomics data (referred to has not been systematically investigated before. here as shotgun standard set, SGSD) for our comparisons. The To address this need, we conducted an extensive compari- profiling standard is available from PeptideAtlas: No. PASS00589 son of 11 popular normalization methods or their variants. (username PASS00589, password WF6554orn). Downloaded from https://academic.oup.com/bib/article/19/1/1/2562889 by DeepDyve user on 18 July 2022 Normalization in Quantitative label-free proteomics | 3 Mouse data linear regression normalizations were implemented using the The mouse data set contains liver samples of seven wild-type robust linear regression of the R-package MASS [21]. The robust male mice and five transgenic male mice overexpressing cyto- linear regression is more robust against outliers in the data chrome P450 aromatase [18]. The samples were analyzed with than linear regression using least squares estimation. The Rlr an MS/MS LTQ Orbitrap Velos Pro mass spectrometer coupled to normalization was implemented as the robust linear regression an EASY-nLC liquid chromatography system [18]. The mouse normalization of Normalyzer [3]. data set is available from the ProteomeXchange with the identi- fier PXD002025 (http://www.ebi.ac.uk/pride/archive/projects/ Local regression normalization (LoessF, LoessCyc) PXD002025). Further details of the data set are available in the The local regression normalization assumes a nonlinear rela- original study [18]. tionship between the bias in the data and the magnitude of pro- tein intensity [9]. We explored two common variants of local regression normalization: LoessF and LoessCyc. The data are Common data preprocessing MA transformed before normalization as with the RlrMA The raw MS files were processed using the Progenesis QI soft- method. LoessF uses the mean intensities over all the samples ware with the default peak-picking settings. ‘Relative quantita- as its reference A sample. LoessCyc is a cyclic normalization tion using non-conflicting peptides’ setting was used, which method in which two samples of the data are MA transformed calculates protein abundance in a run as the sum of all the and normalized at a time, and all pairs of samples are iterated unique peptide ion abundances corresponding to that protein. through. The cycle is repeated three times similarly to the Peptide identifications were performed using Mascot search en- RlrMACyc method. Both of the Loess normalizations were im- gine via Proteome Discoverer. For the database searches, cyst- plemented using the normalizeCyclicLoess-function from R/ eine carbamidomethylation was set as a fixed modification and Bioconductor-package limma [22]. methionine oxidation as a dynamic modification. Mascot score corresponding to false discovery rate of 0.01 was set as a thresh- Variance stabilization normalization (Vsn) old for peptide identifications. The Vsn is a statistical method aiming at making the sample The Progenesis software does not produce missing values variances nondependent from their mean intensities and bring- per se, but produces some zeroes, which can be interpreted as ing the samples onto a same scale with a set of parametric abundance below detection capacity or protein not existing in transformations and maximum likelihood estimation [19]. The the sample. The number of zeros in the data sets was small: Vsn method was implemented with the justvsn function from 0.06–0.6% of the total of all values. As the EigenMS normaliza- the R/Bioconductor-package Vsn [19]. tion method does not accept zero values, they were trans- formed into missing values (Not applicable (NA)). The same Quantile normalization (quantile) preprocessing was used with all the methods for comparability. The quantile normalization forces the distributions of the The exported nonnormalized data from Progenesis were samples to be the same on the basis of the quantiles of the transformed into log2-scale before all other normalizations ex- samples by replacing each point of a sample with the mean of cept for Vsn. The Vsn normalization performs a transformation the corresponding quantile [5]. The quantile normalization similar to the log transformation and requires the input data to was performed using the normalize.quantiles function from the be untransformed [19]. R/Bioconductor-package preprocessCore [23]. Median normalization (median) Data analysis environment The median normalization is based on the assumption that the All the data analyses were done using the R-statistical program- samples of a data set are separated by a constant. It scales the ming language version 3.2.4 [20]. samples so that they have the same median. The median nor- malization was implemented using the median intensity nor- malization of Normalyzer [3]. Summary of the normalization methods Linear regression normalization (Rlr, RlrMA, RlrMACyc) Progenesis normalization (Progenesis) The linear regression normalization assumes that the bias in The Progenesis normalization is the normalization method pro- the data is linearly dependent on the magnitude of the meas- vided by the Progenesis data analysis software. The Progenesis ured protein intensity [9]. As the measured protein intensity in- normalization calculates a global scaling factor between the creases, the bias also increases. We explored three variants of samples by using a selected reference sample to which the the robust linear regression called Rlr, RlrMA and RlrMA cyclic. other samples are normalized to. The Progenesis normalization The Rlr uses the median values over all the samples as its refer- was performed simultaneously with the preprocessing of the ence sample to which all the other samples in the data are nor- data. malized to. The RlrMA is similar, with the exception that the data are MA transformed before normalization, where A refers EigenMS normalization (EigenMS) to the median sample and M is calculated for each sample as The EigenMS normalization fits an analysis of variance model the difference of that sample to A. In the RlrMACyc, there is no to the data to evaluate the treatment group effect and then uses reference, but instead, the MA transformation and the normal- singular value decomposition on the model residual matrix to ization of the samples are done pairwise between two samples, identify and remove the bias [24]. The EigenMS aims at preserv- A being the average of the two samples and M the difference. ing the original differences between treatment groups while The process is iterated through all sample pairs similar to the removing the bias from the data [25]. The EigenMS normaliza- LinRegMA of [10]. The cycle is repeated three times, which has tion was implemented using the R-codes of EigenMS [24] avail- been observed to be enough to reach convergence between iter- able for download in the Sourceforge-repositories (http:// ation cycles for the algorithm [5, 10]. All the variants of the sourceforge.net/projects/eigenms/). Downloaded from https://academic.oup.com/bib/article/19/1/1/2562889 by DeepDyve user on 18 July 2022 4| Valikangas et al. Evaluation of the normalization methods presented in the Results section unless where it is explicitly stated otherwise. We evaluated the normalization methods as follows: (1) in their ability to decrease variation between technical replicates, (2) in their ability to produce data from which the truly differentially ex- Results pressed proteins can be accurately found and (3) in how well the We examined the performance of the 11 normalization meth- logFCs calculated from the normalized data corresponded to what ods in three independent spike-in data sets as well as in a was expected based on theoretical logFCs. We also evaluated mouse data set from a study on changes in mouse liver lipid whether normalizing the data globally or pairwise (i.e. based only metabolism [18]. In the spike-in data sets, the total intensities on the sample groups under comparison) affected the perform- between samples and sample groups should be almost equal. ance of the methods in the differential expression analysis. However, MS data generally show some variation in the total intensities of samples, and this was also the case in the data Intragroup variation and similarity sets used in this study (Supplementary Figures S1–S3A). This is The effect of normalization was evaluated quantitatively using especially true for the UPS1 data set (Supplementary Figure intragroup variability measures that measure the variation be- S1A). After normalization, the situation is changed, and the tween technical replicates. Low intragroup variation means total intensity levels of the samples are nearly equal high similarity between technical replicates, indicating high re- (Supplementary Figures S1–S3). The EigenMS normalization, producibility of the analysis. Intragroup variation was measured however, does not level the total intensities of different samples with PMAD, PCV and PEV. Additionally, similarity of the tech- like the other normalization methods do, rather the distribution nical replicates in sample groups was measured with the of total intensities in different samples of the EigenMS- Pearson correlation coefficient. normalized data is identical to that of the log2-transformed data. Differential expression analysis Differential expression of proteins was examined in each two- group comparison using the reproducibility-optimized test stat- Effect of normalization on intragroup variation istic (ROTS) [26] or the t-test after application of the different Normalization decreased intragroup variation measured as PMAD normalization methods in each data set. The results of the dif- between technical replicates in all data sets when compared ferential expression analyses were evaluated with receiver with the unnormalized log2-transformed data (Figure 1A–C). operating characteristic (ROC) curve analysis, where the spike- Vsn decreased PMAD significantly more than the other normaliza- in proteins were considered as true positives and the back- tion methods in all data sets (Wilcoxon signed rank test P < 0.029 ground proteins as true negatives. The normalization methods between Vsn and the other normalization methods except were ranked based on their performance in the differential ex- EigenMS in the CPTAC data set P ¼ 0.057). Analogous patterns pression analysis using the area under the ROC curve (AUC) as a were observed also for the other intragroup variability measures ranking criterion. Better ranks were assigned to normalization (PCV and PEV) (Supplementary Figure S4A–F). Similarly, intragroup methods with higher AUC values. In case of ties, the normaliza- similarity between technical replicates measured with the tion methods received equal ranks. A mean ranking with asso- Pearson correlation coefficient was highest in the Vsn-normalized ciated standard error was calculated for each normalization data in all spike-in data sets (Figure 1D–F) (Wilcoxon test <0.03 method in each data set. Also, a pooled mean ranking over all with all other methods except EigenMS in the SGSD data set P ¼ 0. the spike-in data sets was calculated for each normalization 059 and LoessF, LoessCyc, Progenesis, quantile and EigenMS in the method. The Satterthwaite approximation was used to calculate CPTAC-data P ¼ 0.052–0.266). the associated standard error for the pooled mean ranking. The normalization methods were ranked independently with each test statistic (ROTS, t-test). Effect of normalization on differential expression When detecting differential expression, ROTS has been shown The log fold changes of the spike-in and background proteins to perform better in proteomics data than the standard t-test The aim of normalization is to remove the unwanted (nonbiolo- [15], and this was the case also in the data sets used in this gical) variation from the data. In case of the spike-in data sets study (Supplementary Figure S8, Supplementary Tables S1–S2). used in this study, the levels of spike-in proteins should change, Normalizing the data improved the AUCs of the differential ex- while the levels of the background proteins should remain un- pression analysis in general (Figure 2A–C, Table1). However, changed. We examined the distributions of the logFCs of the there was considerable variation in the performance of the dif- spike-in and background proteins in data normalized with the ferent normalization methods in the different data sets tested. different methods. The benefits of normalization were most prominent in the UPS1 data set (Figure 2A, Table 1), in which all the other normal- Evaluation of the normalization types ization methods were ranked higher than the simple log2 trans- To explore if there is a difference in the performance of the nor- formation except for the EigenMS and the Quantile malization methods depending on the way in which the nor- normalization. The Vsn-normalized data had the highest AUC malization is done, the data were normalized in two ways: in every two-group comparison in the UPS1 data set when using globally and pairwise. In global normalization, the whole data ROTS (Delong’s test P < 0.04 with all the other methods). containing all the sample groups of a data set were normalized In the CPTAC and SGSD data sets, the differences between at once. In pairwise normalization, the sample groups being the normalization methods were smaller on average, but some compared in the differential expression analysis were first differences were found. In the CPTAC data set, all the normal- extracted from the unnormalized data and then normalized ization methods, except for the median normalization, ranked separately. Owing to the similarity of the results of the normal- on average higher than the log2 transformation when the differ- ization types, only results of the global normalization are ential expression was analyzed with ROTS (Figure 2B, Table 1). Downloaded from https://academic.oup.com/bib/article/19/1/1/2562889 by DeepDyve user on 18 July 2022 Normalization in Quantitative label-free proteomics | 5 Figure 1. The effect of normalization method on intragroup variation between technical replicates. The PMADs of (A) UPS1 data, (B) CPTAC data and (C) SGSD data. The Pearson correlation coefficients of (D) UPS1 data, (E) CPTAC data and (F) SGSD data. In most of the two-group comparisons in the CPTAC data, no high mean ranks regardless of the test statistic used (Table 1). significant differences in the AUCs produced by the best ranking The linear regression methods relying on an artificial reference normalization method and the other methods were observed (RlrMA and Rlr) and the local regression method using an artifi- cial reference (LoessF) also performed systematically well (Delong’s test P > 0.05), with few exceptions. In the 0.74 versus 2.2 fmol comparison, the Progenesis normalization ranked first throughout all the comparisons in all data sets (Figure 2, Table and gave a significantly higher AUC than 8 of 10 methods 1). Some of the visuals are overlapping in Figure 2. In particular, (Delong’s test P < 0.049). In the 2.2 versus 6.7 fmol comparison, LoessF is covered largely by the lines of the other normalization the Vsn normalization ranked first and gave a significantly methods: Progenesis normalization in Figure 2A and other higher AUC than 6 of 10 methods (Delong’s test P < 0.028). In the methods in Figures 2B and C. 0.25 versus 0.74 comparison, the RlrMACyc normalization method ranked best and gave an AUC significantly higher than Effect of normalization type half of the other methods (Delong’s test P < 0.044 for 5 of 10 methods). In general, whether the data were normalized globally or pair- In the SGSD data set, differences between the different nor- wise between the two groups compared did not have a major ef- malization methods and the log2 transformation were generally fect on the AUCs of the differential expression analysis (Figure small. Only five normalization methods, the Vsn, RlrMA, Rlr, 2A–C versus Supplementary Figure S5A–C). The only exceptions RlrMACyc and LoessF, ranked on average higher than the log2 were the cyclic normalization methods, LoessCyc and transformation in the SGSD data set (Table 1). In most of the RlrMACyc, which benefitted from normalizing the data pairwise two-group comparisons, there was no significant difference be- in the UPS1 data set (Figure 2A versus Supplementary Figure tween the AUC of the best ranking method and the AUCs of the S5A). This could also be seen in the MA plots of the UPS1 data, other methods (Delong’s test P > 0.05), with few exceptions. in which the data were centered well in the line M ¼ 0 in the In the 5 versus 7, 5 versus 8, 6 versus 7, 6 versus 8 and 7 versus 8 pairwise normalized data of the cyclic methods, but not in the comparisons, the Vsn normalization consistently ranked first globally normalized data of the same methods (Supplementary and gave a higher AUC than most of the other methods tested File S1). (Delong’s test P < 0.046 for 6–8 of 10 methods; Figure 2C). While no single method gave the highest AUC in every two- Effect of normalization on logFC group comparison, the Vsn normalization performed consist- ently well, giving high AUCs throughout all data sets. This re- When looking at the distribution of the logFC of the background sulted in the highest pooled mean rank across all data sets and proteins in all data sets, we can see that it is centered around Downloaded from https://academic.oup.com/bib/article/19/1/1/2562889 by DeepDyve user on 18 July 2022 6| Valikangas et al. Figure 2. The effect of normalization method on differential expression results. The AUCs of the ROC curves of differential expression analysis in (A) UPS1 data, (B) CPTAC data and (C) SGSD data globally normalized with the different methods. The x axes denote the two-group comparisons of the sample groups. Table 1. Rankings of the normalization methods based on AUCs of the ROC curves of the differential expression analysis using global normal- ization. Best mean ranking in each data set and best pooled mean ranking with each test statistic are bolded. The methods were ranked inde- pendently when using different test statistics Normalization method Statistical test UPS1 CPTAC SGSD Pooled mean Log2 ROTS 8 6 1.02 8.5 6 0.76 4.6 6 0.78 5.9 6 1.49 t-test 7.1 6 1.35 8 6 1.29 4.5 6 0.74 5.6 6 1.87 Loess_fast ROTS 3.5 6 0.34 4.7 6 1.09 4.3 6 0.51 4.2 6 1.25 t-test 1.9 6 0.55 4.7 6 0.84 5.4 6 0.5 4.5 6 1.06 Loess_cyclic ROTS 6.9 6 1.17 2.8 6 0.75 5.1 6 0.64 5.2 6 1.53 t-test 7.9 6 0.89 4.3 6 0.95 7 6 0.58 6.8 6 1.43 Rlr_scatter ROTS 6.4 6 0.5 3.5 6 0.43 4.1 6 0.48 4.5 6 0.82 t-test 6.5 6 0.78 4.3 6 1.02 4.2 6 0.46 4.7 6 137 Rlr_ma ROTS 6.3 6 0.7 3.8 6 0.48 3.9 6 0.5 4.4 6 0.98 t-test 5.7 6 0.68 4.3 6 0.99 3.9 6 0.42 4.4 6 1.3 Rlr_ma_cyclic ROTS 7.3 6 0.54 6.5 6 1.57 4.3 6 0.58 5.3 6 1.75 t-test 6.3 6 0.67 4.2 6 1.45 4.6 6 0.5 4.9 6 1.68 Vsn ROTS 1 6 0 4.3 6 1.41 2.7 6 0.46 2.5 6 1.48 t-test 3.9 6 0.31 3.8 6 0.87 3.5 6 0.34 3.6 6 0.98 Quantile ROTS 8.2 6 0.61 7.7 6 0.56 7 6 0.74 7.4 6 1.11 t-test 8.8 6 0.39 8.3 6 0.67 9.6 6 0.46 9.2 6 0.99 Median ROTS 5.9 6 0.75 9.5 6 0.72 5 6 0.68 5.8 6 1.24 t-test 6.2 6 0.98 10 6 0.54 5.3 6 0.65 6.2 6 1.16 Progenesis ROTS 3.3 6 0.67 6.7 6 1.41 6.1 6 0.75 5.5 6 1.73 t-test 3 6 0.54 5.3 6 1.17 7.1 6 0.59 5.9 6 1.41 EigenMS ROTS 9.2 6 0.76 8 6 1.32 5.5 6 0.85 6.7 6 1.74 t-test 8.6 6 0.92 8.5 6 1.18 5.3 6 0.8 6.5 6 1.51 Downloaded from https://academic.oup.com/bib/article/19/1/1/2562889 by DeepDyve user on 18 July 2022 Normalization in Quantitative label-free proteomics | 7 Figure 3. The logFC of the background proteins and representative examples of the logFC of the spike-in proteins. (A) The density distributions of the logFC of the back- ground proteins over all two-group comparisons in all data sets. The vertical dashed line corresponds to logFC of zero. The logFC of the spike-in proteins (upper boxes) and the background proteins (lower boxes) in the (B) 10 versus 25 fmol comparison of the UPS1 data and (C) the 0.74 versus 2.2 fmol comparison of the CPTAC data. The horizontal solid black lines correspond to logFC of zero, while the horizontal dashed lines correspond to the theoretical expected logFC of the spike-in proteins. zero for all the other normalization methods except for the well centered already after the logarithm transformation. In the EigenMS normalization (Figure 3A), for which the distribution UPS1 data, the data after the cyclic normalizations (RlrMACyc was identical to that of the log2 transformation. The distribu- and LoessCyc) were much more centered after pairwise normal- tion of the logFCs in the Vsn-normalized data was more concen- ization than after global normalization (Supplementary File S1). trated around zero than in data sets normalized with the other In many two-group comparisons, the quantile normalization methods, which can be seen as a narrower and higher density seemed to introduce extra patterns into the data on high inten- distribution for the Vsn-normalized data. sities not seen in the unnormalized log2-transformed data Based on the known concentrations of the spike-in proteins, (Supplementary File S1). the logFCs of the spike-in proteins were typically underesti- mated both in the normalized data as well as in the log2- Testing on mouse data transformed data (Figure 3B and C, Supplementary File S2). The In addition to the three spike-in data sets, we also compared EigenMS-normalized data gave similar estimates as the log2- the performance of the normalization methods in a mouse transformed data; the Vsn normalization gave generally more study data set, which represents a typical real study setting [18]. conservative estimates than the other normalization methods. When looking at the levels of total intensities of the samples in All the other normalization methods gave consistently similar the log2-transformed mouse data, we can see that they are un- estimates for the logFC of the spike-in proteins. In the UPS1 equal (Supplementary Figure S6A). When applying normaliza- data, the logFC of the spike-in proteins of the normalized data tion, most of the methods equalize the levels of total intensities was closer to the theoretical known logFC in general than in the of different samples, except for the EigenMS (Supplementary log2-transformed data (Supplementary File S2). Figure S6B–K). In the mouse data set, we investigated biological replicates of Visual quality inspection the same treatment group instead of technical replicates. Similar The MA plot is a common tool for exploring the bias in the data patterns for intragroup variation for data normalized with the dif- of two samples [5, 9]. Normalization aims to remove the bias ferent methods were observed as with the spike-in data sets from the data and center the data scatter of the sample pair (Figure 5). All normalization methods decreased intragroup vari- examined around the x axis (M ¼ 0) in the MA plot. In this study, ation when measured with the PMAD compared with the unnor- MA plots were drawn and observed with each normalization malized data. PMAD was smallest in the Vsn- and EigenMS- method in each two-group comparison of each data set. Based normalized data, but the differences to the other methods were on visual inspection of these plots, the Vsn normalization not significant (Wilcoxon signed rank test >0.33; Figure 5A). seems to concentrate the data more tightly both around the x Similar patterns were observed with the other intragroup meas- axis and to a narrower scale of transformed intensities than the ures PCV and PEV (Supplementary Figure S7). Intragroup similar- logarithm transformation and the other normalization methods ity measured with the Pearson correlation coefficient was highest in general (Figure 4, Supplementary File S1). In the CPTAC and among the EigenMS-normalized data, but the differences to the the SGSD data sets, the data in the two-group comparisons were other methods were small (Wilcoxon P > 0.18; Figure 5B). The Downloaded from https://academic.oup.com/bib/article/19/1/1/2562889 by DeepDyve user on 18 July 2022 8| Valikangas et al. Figure 4. Representative MA plots of the two-group comparisons after normalization with the most successful normalization method and log2 transformation in each data set. MA plots of the (A) 2 versus 10 fmol comparison of the UPS1 data, (B) 0.25 versus 2.2 fmol comparison of the CPTAC data and (C) sample 1 versus sample 4 com- parison of the SGSD data normalized with the Vsn normalization. MA plots of the (D) 2 versus 10 fmol comparison of the UPS1 data, (E) 0.25 versus 2.2 fmol comparison of the CPTAC data and (F) sample 1 versus sample 4 comparison of the SGSD data after the log2 transformation. The lighter nonblack points in the plots correspond to the spike-in proteins and the black points to the background proteins. The curve corresponds to a loess smoothing function. Figure 5. Intragroup variation between biological replicates in the mouse data normalized with the different methods. (A) The PMADs and (B) the Pearson correlation coefficients. Downloaded from https://academic.oup.com/bib/article/19/1/1/2562889 by DeepDyve user on 18 July 2022 Normalization in Quantitative label-free proteomics | 9 mouse data did not contain any spike-in proteins and thus we might be different in different data sets. Therefore, the used nor- did not have prior knowledge about expected protein changes. malization should not make too rigid assumptions about the na- Therefore, differential expression analysis was not directly ap- ture of the bias, unless we know or can estimate the bias and plicable to assess the performance of the normalization meth- purposefully want to use a method targeting specifically that ods. The same was true for the logFC. kind of bias. The Vsn, quantile and the EigenMS normalizations do not make strict assumptions about the nature of the bias and are general methods in that sense. The median and quantile normalizations were on par with Discussion most of the normalization methods in reducing intragroup vari- In the spike-in data sets examined in this study, the Vsn nor- ation, but they did not rank well in terms of differential expres- malization consistently reduced intragroup variation the most, sion analysis. It is notable, however, that even though not increased intragroup similarity the most and gave consistently having a high ranking, both methods performed consistently in high AUCs in the differential expression analysis, resulting in the differential expression analysis by not producing low AUCs the highest pooled mean ranking among the normalization in any of the two-group comparisons like the log2 transform- methods tested. The EigenMS normalization also consistently ation did in the UPS1 data set (Figure 2A). More worrying is the reduced intragroup variation more than the other methods tendency of the quantile normalization to introduce extra pat- examined, but it did not perform well in the differential expres- terns into the data on high intensities seen on many two-group sion analysis. Also, other normalization methods decreased comparisons (Supplementary File S1). The Progenesis normal- intragroup variation when compared with the unnormalized ization had the second highest ranking in the differential ex- log2-transformed data, but no major differences between them pression analysis in the UPS1 data, but ranked worse in the two were observed. In previous comparisons of normalization meth- other data sets examined (Table 1). The EigenMS behaved ods in proteomics/peptidomics focusing on intragroup variation differently from the other normalization methods examined in measures, the Vsn normalization has been ranked average [10] this study. While it was effective in reducing intragroup vari- or as among the most suitable methods [3]. Previous studies ation, it did not perform so well in the differential expression have suggested the linear regression normalization or its vari- analysis. Instead, it performed similarly as the simple log2 ants or local regression normalization to reduce intragroup vari- transformation. ation the most [3, 9, 10]. We observed the linear regression An arbitrary but commonly used cutoff value to deter- normalization variants and the local regression normalization mine differentially expressed genes and proteins is a logFC variants performing on par with the other normalization meth- of one [27–29], which corresponds to a 2-fold change in ex- ods in reducing intragroup variation, with no major differences. pression. As we noticed from the logFC plots of the data nor- However, even though we did not observe the linear and local malized with the different methods (Figure 3B and C, regression to reduce intragroup variation more than the other Supplementary File S2), the estimates for the known differ- normalization methods, we noticed that the local regression entially expressed proteins frequently remained under this method using a mean reference sample, LoessF, consistently limit even if the differentially expressed proteins were de- produced high AUCs in the differential expression analysis. The tected with great accuracy. This was especially true for the same was true for the linear regression methods using a median Vsn-normalized data, which gave conservative estimates for reference sample, Rlr and RlrMA. The local regression normal- the logFC of the spike-in proteins, but from which the spike- ization fared better in the UPS1 data set, while the linear regres- in proteins were detected with great accuracy. This warrants sion normalization performed better in the CPTAC and SGSD caution for the use of any such generic cutoff values for fil- data sets, perhaps indicating a different kind of bias in the data tering the differentially expressed proteins based on their sets. Typically, the variants using a reference sample performed logFC. better than their cyclic counterparts, with the exception of the Although Vsn performed generally well in our comparisons, cyclic loess normalization LoessCyc in the CPTAC data. the fact that it consistently underestimated the logFCs of the It became clear that the spike-in data sets in this analysis dif- spike-in proteins can be seen as a potential drawback of the fered from each other. The sample groups of the UPS1 data set method if the researcher would be interested particularly in had much larger variation in the total intensities than the other examining the logFCs of proteins. For this particular task, some two data sets, especially the SGSD data set, which had many of the other well-performing normalization methods (LoessF, sample groups with roughly similar levels of total intensities. Rlr, RlrMA) would be perhaps more suitable. Also, all of the nor- This could be because of a number of reasons, such as different malization methods studied here, excluding EigenMS, assume instrumentation or protocols/methods used, but is interesting that only a small portion of the proteins are differentially ex- from the point of normalization. The total intensities between pressed between samples and force the total intensity levels of the samples may vary from data to data also in the case of real the samples to be on the same level (Supplementary Figure S1). experimental study settings, and we would like to find a normal- This might be problematic if in fact a large number of proteins ization method that can perform as consistently as possible no are differentially expressed between samples. In such cases, matter the characteristics of the data. Notably, normalization methods like the EigenMS might be more suitable for normaliz- clearly improved the AUCs also in the CPTAC data set when ing the data. We encourage the researcher to reflect on what is compared with the unnormalized log2-transformed data (Table known beforehand about the task at hand and select the appro- 1), regardless of the fact that it had rather equal total intensity priate normalization method accordingly. levels before normalization. This emphasizes the importance of All of the normalizations in this study were performed on a consistent normalization method; even if we have a high- protein-level data. Normalization can be performed also at the quality data set with rather equal unnormalized sample levels, peptide level. The next step would be to perform a similar ex- we cannot necessarily deduce whether a simple logarithmic haustive comparison of the normalization methods on peptide transformation would suffice in delivering the truly differen- level and explore if the same methods fare well with peptide tially expressed proteins reliably. Also, the nature of the bias data. Also, the choice of peptides to be used for Downloaded from https://academic.oup.com/bib/article/19/1/1/2562889 by DeepDyve user on 18 July 2022 10 | Valikangas et al. 5. Bolstad BM, Irizarry RA, Astrand M, et al. A comparison of nor- the normalization has been demonstrated to have an effect [11], and exploring this idea in conjunction with the normalizations malization methods for high density Oligonucleotide array used in this study would be an interesting further topic. data based on variance and bias. Bioinformatics 2003;19:185–93. Based on the comparisons made in this study, normalization 6. Choe SE, Boutros M, Michelson AM, et al. Preferred analysis decreased intragroup variation in general and resulted in better methods for Affymetrix GeneChips revealed by a wholly AUCs in the differential expression analysis than the simple defined control dataset. Genome Biol 2005;6:R16. 7. Yang YH, Dudoit S, Luu P, et al. Normalization for cDNA log2 transformation in case of most of the normalization meth- ods examined. The Vsn normalization performed consistently microarray data. In: ML Bittner, YD Chen, AN Dorsel, ER well in reducing intragroup variation and in the differential ex- Dougherty (eds). Microarrays: Optical Technologies and pression analysis in all tested data sets. The local regression Informatics, Vol. 10. San Jose: SPIE, Society for Optical Engineering, 2001, 141–52. and linear regression normalizations using a reference also reduced intragroup variation compared with the unnormalized 8. Schadt EE, Li C, Ellis B, et al. Feature extraction and data and consistently delivered good AUCs in the differential normalization algorithms for high-density Oligon expression analysis. ucleotide gene expression array data. J Cell Biochem 2001;125:120–5. 9. Callister SJ, Barry RC, Adkins JN, et al. Normalization Key Points approaches for removing systematic biases associated with mass spectrometry and label-free proteomics. J Proteome Res Data generated by the MS analysis are prone to biases, 2006;5:277–86. which can be accounted for with normalization result- 10. Kultima K, Nilsson A, Scholz B, et al. Development and evalu- ing in more reliable downstream analysis. ation of normalization methods for label-free relative quanti- In total, 11 normalization methods were systematic- fication of endogenous peptides. Mol Cell Proteomics ally evaluated in this study using three spike-in and a 2009;8:2285–95. mouse label-free proteomics data sets. 11. Webb-Robertson B-JM, Matzke MM, Jacobs JM, et al. A statis- Vsn reduced variation the most between the technical tical selection strategy for normalization procedures in LC- replicates in all studied data sets and consistently per- MS proteomics experiments through dataset dependent formed well in the differential expression analysis. ranking of normalization scaling factors. Proteomics The local regression normalization using an artificial 2011;11:4736–41. reference sample (LoessF) and linear regression nor- 12. Chawade A, Sandin M, Teleman J, et al. Data processing has malization using artificial reference samples (Rlr and major impact on the outcome of quantitative label-free LC- RlrMA) also performed systematically well in the dif- MS analysis. J Proteome Res 2015;14:676–87. ferential expression analysis. 13. Cox J, Hein MY, Luber CA, et al. Accurate proteome-wide The nature and extent of the bias in the data are not label-free quantification by delayed normalization and max- generally known beforehand; the application of a con- imal peptide ratio extraction, termed MaxLFQ. Mol Cell sistent normalization method is crucial for reliable Proteomics 2014;13:2513–26. results. 14. Zhang B, Kall L, Zubarev RA. DeMix-Q: quantification- centered data processing workflow. Mol Cell Proteomics 2016;15:1467–78. 15. Pursiheimo A, Vehmas AP, Afzal S, et al. Optimization of stat- Supplementary Data istical methods impact on quantitative proteomics data. J Proteome Res 2015;14:4118–26. Supplementary data are available online at http://bib.oxford 16. Tabb DDL, Vega-Montoto L, Rudnick PA, et al. Repeatability journals.org/. and reproducibility in proteomic identifications by liquid chromatography-tandem mass spectrometry. J Proteome Res Funding 2010;9:761–76. 17. Bruderer R, Bernhardt OM, Gandhi T, et al. Extending the This study was supported by the Sigrid Juselius Foundation, limits of quantitative proteome profiling with data- JDRF [grant number 2-2013-32] and the European Research independent acquisition and application to acetaminophen Council (ERC) Starting Grant no. 677943. treated 3D liver microtissues. Mol Cell Proteomics 2015;14:1400–10. References 18. Vehmas AP, Adam M, Laajala TD, et al. Liver lipid metabolism is altered by increased circulating estrogen to androgen ratio 1. Megger DA, Bracht T, Meyer HE, et al. Label-free quantification in male mouse. J Proteomics 2016;133:66–75. in clinical proteomics. Biochim Biophys Acta 2013;1834: 19. Huber W, von Heydebreck A, Su ¨ ltmann H, et al. Variance sta- 1581–90. bilization applied to microarray data calibration and to the 2. Meissner F, Mann M. Quantitative shotgun proteomics: con- quantification of differential expression. Bioinformatics siderations for a high-quality workflow in immunology. Nat 2002;18(Suppl 1):S96–104. Immunol 2014;15:112–7. 20. R Core Team. R: A Language and Environment for Statistical 3. Chawade A, Alexandersson E, Levander F. Normalyzer: a tool Computing. R Foundation for Statistical Computing, Vienna, for rapid evaluation of normalization methods for Omics Austria, 2015. URL https://www.R-project.org/. data sets. J Proteome Res 2014;13:3114–20. 21. Venables WN, Ripley BD. Modern Applied Statistics with S. 4. Karpievitch YV, Dabney AR, Smith RD. Normalization and Fourth Edition. Springer, New York, 2002. ISBN 0-387- missing value imputation for label-free LC-MS analysis. BMC 95457-0. Bioinformatics 2012;13:S5. Downloaded from https://academic.oup.com/bib/article/19/1/1/2562889 by DeepDyve user on 18 July 2022 Normalization in Quantitative label-free proteomics | 11 26. Elo L, File ´ n S, Lahesmaa R, et al. Reproducibility-optimized 22. Ritchie ME, Phipson B, Wu D, et al. limma powers differential expression analyses for RNA-sequencing and microarray test statistic for ranking genes in microarray studies. IEEE/ studies. Nucleic Acids Res 2015;43:e47. ACM Trans Comput Biol Bioinform 2008;5:423–31. 27. Quackenbush J. Microarray data normalization and trans- 23. Bolstad BM. preprocessCore: A collection of pre-processing formation. Nat Genet 2002;32:496–501. functions. R package version 1.32.0, https://github.com/ 28. DeRisi J, Penland L, Brown PO, et al. Use of a cDNA microarray bmbolstad/preprocessCore. to analyse gene expression patterns in human cancer. Nat 24. Karpievitch YV, Taverner T, Adkins JN, et al. Normalization Genet 1996;14:457–60. of peak intensities in bottom-up MS-based proteomics using singular value decomposition. Bioinformatics 2009;25: 29. Cho SH, Goodlett D, Franzblau S. ICAT-based comparative prote- omic analysis of non-replicating persistent Mycobacterium tu- 2573–80. berculosis. Tuberculosis 2006;86:445–60. 25. Karpievitch YV, Nikolic SB, Wilson R, et al. Metabolomics data normalization with EigenMS. PLoS One 2014;9:e116221. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Briefings in Bioinformatics Oxford University Press

A systematic evaluation of normalization methods in quantitative label-free proteomics

Loading next page...
1
 
/lp/ou_press/a-systematic-evaluation-of-normalization-methods-in-quantitative-label-G0Nx1U0QZX

References (33)

Publisher
Oxford University Press
Copyright
Copyright © 2022 Oxford University Press
ISSN
1467-5463
eISSN
1477-4054
DOI
10.1093/bib/bbw095
pmid
27694351
Publisher site
See Article on Publisher Site

Abstract

To date, mass spectrometry (MS) data remain inherently biased as a result of reasons ranging from sample handling to differences caused by the instrumentation. Normalization is the process that aims to account for the bias and make samples more comparable. The selection of a proper normalization method is a pivotal task for the reliability of the downstream analysis and results. Many normalization methods commonly used in proteomics have been adapted from the DNA microarray techniques. Previous studies comparing normalization methods in proteomics have focused mainly on intragroup variation. In this study, several popular and widely used normalization methods representing different strategies in normalization are evaluated using three spike-in and one experimental mouse label-free proteomic data sets. The normalization methods are evaluated in terms of their ability to reduce variation between technical replicates, their effect on differential expression analysis and their effect on the estimation of logarithmic fold changes. Additionally, we examined whether normalizing the whole data globally or in segments for the differential expression analysis has an effect on the performance of the normalization methods. We found that variance stabilization normalization (Vsn) reduced variation the most between technical replicates in all examined data sets. Vsn also performed consistently well in the differential expression analysis. Linear regression normalization and local regression normalization performed also systematically well. Finally, we discuss the choice of a normalization method and some qualities of a suitable normalization method in the light of the results of our evaluation. Key words: proteomics; normalization; label free; bias; differential expression; logarithmic fold change; quantitation; intragroup variation; reproducibility; mass spectrometry Introduction nonbiological sources, which is introduced by small variations The development of mass spectrometry (MS)-based proteomics in the experimental conditions in the course of carrying out the has been rapid. Modern proteomics aims not only to identify MS analysis [4]. These variations include, for example, differ- the proteins but also to quantify them as accurately as possible ences in sample preparation and handling, device calibration or [1]. Current MS-based proteomics workflows are able to detect changes in temperature, but the exact reason of the bias is usu- thousands of proteins, their modifications and localizations in a ally unknown and cannot thus be solely accounted for by ad- single run [2]. Despite all the developments of MS technologies, justing the experimental settings [3, 4]. The observed bias can the data from the MS analysis are still susceptible to systematic be independent or dependent on the measured protein abun- biases [3]. This bias has been defined as variation caused by dances [4]. Tommi Valikangas is a Research Scientist in the Computational Biomedicine Group at the Turku Centre for Biotechnology Finland. He is interested in computational biology and bioinformatics. Tomi Suomi is a Research Scientist in the Computational Biomedicine research group at the Turku Centre for Biotechnology Finland. His research inter- ests include scientific computing and bioinformatics. Laura L. Elo is Adjunct Professor in Biomathematics, Research Director in Bioinformatics and Group Leader in Computational Biomedicine at Turku Centre for Biotechnology, University of Turku, Finland. Her main research interests include computational biomedicine and bioinformatics. Submitted: 9 June 2016; Received (in revised form): 6 September 2016 V The Author 2016. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. 1 Downloaded from https://academic.oup.com/bib/article/19/1/1/2562889 by DeepDyve user on 18 July 2022 2| Valikangas et al. The process that aims to take the bias into account is called Other normalization approaches not covered in this study exist, normalization. Normalization aims to make the samples of the such as the MaxLFQ integrated into the MaxQuant software [13] data more comparable and the following downstream analysis and the normalization integrated into the DeMix-Q software reliable [3]. Many of the normalization methods used for prote- [14]. These normalizations, however, are integral parts of prote- omics data have their roots in the DNA microarray technology omics software workflows as opposed to the stand-alone nor- [4], where several evaluations and reviews have already eluci- malization methods examined in this comparison, with the dated their performance [5–8]. For instance, Bolstad et al. [5] exception of Progenesis normalization. All the normalization compared five normalization methods with DNA microarray methods examined are commonly used methods in proteomics data and concluded that most of them performed rather simi- and have different approaches and assumptions regarding the larly and reduced nonbiological variability across arrays when bias occurring in the data. Three spike-in label-free proteomics compared with the unnormalized data. Choe et al. [6] also found data sets were used for benchmarking the normalization meth- no significant differences between the four normalization ods. The spike-in data sets are suitable for this kind of method methods they examined with an RNA spike-in experiment at testing, as the differences between sample groups are known, the probe level. In previous comparisons in proteomics, and methods can be evaluated in their ability to find the true Callister et al. [9] used three different liquid chromatography– differences and to level out other biologically nonexisting differ- MS (LC-MS) data sets to evaluate four different normalization ences. Additionally, a data set from a mouse study was also methods on peptide level and found a linear regression normal- used to compare the performance of the normalization meth- ization best suited for their data sets. Kultima et al. [10] com- ods in a non-spike-in data set, representing a typical real re- pared 10 different normalization methods with three different search setting. Offline fractionation, which adds another layer peptidomics data sets and noticed that the order of the LC-MS of complexity to normalization, was not used in any of the experiments affected the bias in the data; they suggested that tested data sets. In such cases, the total peptide ion signals of their novel RegrRun normalization, which combines linear re- each fraction are spread over several runs, which should be gression normalization with analysis order normalization, was normalized before summing up the values [12]. the best overall method in reducing unwanted intragroup and intrasample variation. Different tools for helping in the selection of a normalization Materials and methods method have also been proposed. Webb-Robertson et al. [11] Description of the data sets stated that a single method cannot account for the bias in dif- ferent data sets; rather it is crucial for reliable downstream ana- The UPS1 data set lysis to select the appropriate normalization method for each Benchmarking data of Pursiheimo et al. [15] include Universal data set. They introduced a tool called SPANS, which combines Proteomics Standard Set (UPS1) proteins spiked into a yeast eight methods for peptide selection to be used in normalization proteome digest to create concentrations of 2, 4, 10, 25 and with five normalization methods [11]. Chawade et al. [3] also 50 fmol/ll. Three technical replicates of each concentration introduced a tool for choosing a proper normalization method were analyzed using LTQ Orbitrap Velos mass spectrometer. called Normalyzer. Their tool includes several popular normal- The spike-in data are available from the PRIDE Archive with the ization methods such as linear regression, local regression, total identifier PXD002099 (http://www.ebi.ac.uk/pride/archive/pro intensity, average intensity, median intensity, variance stabil- jects/PXD002099). ization normalization (Vsn) and quantile normalization, to- gether with several frequently used evaluation measures used to assess the performance of a normalization method such as The CPTAC data set the pooled coefficient of variation (PCV), the pooled median ab- The CPTAC (Study 6) data set [16] contains UPS1 proteins spiked solute deviation (PMAD) and the pooled estimate of variance into a yeast proteome digest with concentrations of 0.25, 0.74, (PEV) [3]. 2.2, 6.7 and 20 fmol/ll. Three technical replicates of each con- So far, comparisons of normalization methods in proteomics centration were analyzed using LTQ Orbitrap mass spectrom- have typically focused on their ability to decrease intragroup vari- eter (at test site 86). The LTQ Orbitrap@86 spike-in data are ation between technical and/or biological replicates of the test available from the CPTAC-portal (http://cptac-data-portal.geor data. Measures for the intragroup variation such as PEV [3, 9, 10], getown.edu/cptac/dataPublic/list/LTQ-Orbitrap%4086?current PCV [3], PMAD [3], the median coefficient of variation (CV) [9]and Path¼%2FPhase_I_Data%2FStudy6). Sample Group E was left out the median SD [10] have been used to rank the normalization from our analysis, as it had only two technical replicates be- methods compared. While reducing intragroup variation is cer- cause of the Progenesis software being unable to align one of tainly a central goal of normalization, a more thorough compari- the technical replicates automatically. son of the normalization methods and their performance in proteomics is still lacking. Although interesting questions such as differences in the correct detection of truly differentially expressed The SGSD data set proteins in the data normalized by different normalization meth- The profiling standard of Bruderer et al. [17] contains 12 nonhu- ods have been investigated before [3, 12, 13], a thorough system- man proteins spiked into a constant human background (HEK- atic analysis using multiple data sets and two-group comparisons 293). It contains eight different sample groups with known con- has not been available in proteomics. Also, the effect of the nor- centrations of the spike-in proteins. Each of the samples con- malization method on the estimation of the logarithmic fold tains three replicates, which have been analyzed both in data- change (logFC) or the effect of how the normalization is performed dependent acquisition (DDA) and data-independent acquisition when comparing only two sample groups from a larger data set modes. We used the DDA shotgun proteomics data (referred to has not been systematically investigated before. here as shotgun standard set, SGSD) for our comparisons. The To address this need, we conducted an extensive compari- profiling standard is available from PeptideAtlas: No. PASS00589 son of 11 popular normalization methods or their variants. (username PASS00589, password WF6554orn). Downloaded from https://academic.oup.com/bib/article/19/1/1/2562889 by DeepDyve user on 18 July 2022 Normalization in Quantitative label-free proteomics | 3 Mouse data linear regression normalizations were implemented using the The mouse data set contains liver samples of seven wild-type robust linear regression of the R-package MASS [21]. The robust male mice and five transgenic male mice overexpressing cyto- linear regression is more robust against outliers in the data chrome P450 aromatase [18]. The samples were analyzed with than linear regression using least squares estimation. The Rlr an MS/MS LTQ Orbitrap Velos Pro mass spectrometer coupled to normalization was implemented as the robust linear regression an EASY-nLC liquid chromatography system [18]. The mouse normalization of Normalyzer [3]. data set is available from the ProteomeXchange with the identi- fier PXD002025 (http://www.ebi.ac.uk/pride/archive/projects/ Local regression normalization (LoessF, LoessCyc) PXD002025). Further details of the data set are available in the The local regression normalization assumes a nonlinear rela- original study [18]. tionship between the bias in the data and the magnitude of pro- tein intensity [9]. We explored two common variants of local regression normalization: LoessF and LoessCyc. The data are Common data preprocessing MA transformed before normalization as with the RlrMA The raw MS files were processed using the Progenesis QI soft- method. LoessF uses the mean intensities over all the samples ware with the default peak-picking settings. ‘Relative quantita- as its reference A sample. LoessCyc is a cyclic normalization tion using non-conflicting peptides’ setting was used, which method in which two samples of the data are MA transformed calculates protein abundance in a run as the sum of all the and normalized at a time, and all pairs of samples are iterated unique peptide ion abundances corresponding to that protein. through. The cycle is repeated three times similarly to the Peptide identifications were performed using Mascot search en- RlrMACyc method. Both of the Loess normalizations were im- gine via Proteome Discoverer. For the database searches, cyst- plemented using the normalizeCyclicLoess-function from R/ eine carbamidomethylation was set as a fixed modification and Bioconductor-package limma [22]. methionine oxidation as a dynamic modification. Mascot score corresponding to false discovery rate of 0.01 was set as a thresh- Variance stabilization normalization (Vsn) old for peptide identifications. The Vsn is a statistical method aiming at making the sample The Progenesis software does not produce missing values variances nondependent from their mean intensities and bring- per se, but produces some zeroes, which can be interpreted as ing the samples onto a same scale with a set of parametric abundance below detection capacity or protein not existing in transformations and maximum likelihood estimation [19]. The the sample. The number of zeros in the data sets was small: Vsn method was implemented with the justvsn function from 0.06–0.6% of the total of all values. As the EigenMS normaliza- the R/Bioconductor-package Vsn [19]. tion method does not accept zero values, they were trans- formed into missing values (Not applicable (NA)). The same Quantile normalization (quantile) preprocessing was used with all the methods for comparability. The quantile normalization forces the distributions of the The exported nonnormalized data from Progenesis were samples to be the same on the basis of the quantiles of the transformed into log2-scale before all other normalizations ex- samples by replacing each point of a sample with the mean of cept for Vsn. The Vsn normalization performs a transformation the corresponding quantile [5]. The quantile normalization similar to the log transformation and requires the input data to was performed using the normalize.quantiles function from the be untransformed [19]. R/Bioconductor-package preprocessCore [23]. Median normalization (median) Data analysis environment The median normalization is based on the assumption that the All the data analyses were done using the R-statistical program- samples of a data set are separated by a constant. It scales the ming language version 3.2.4 [20]. samples so that they have the same median. The median nor- malization was implemented using the median intensity nor- malization of Normalyzer [3]. Summary of the normalization methods Linear regression normalization (Rlr, RlrMA, RlrMACyc) Progenesis normalization (Progenesis) The linear regression normalization assumes that the bias in The Progenesis normalization is the normalization method pro- the data is linearly dependent on the magnitude of the meas- vided by the Progenesis data analysis software. The Progenesis ured protein intensity [9]. As the measured protein intensity in- normalization calculates a global scaling factor between the creases, the bias also increases. We explored three variants of samples by using a selected reference sample to which the the robust linear regression called Rlr, RlrMA and RlrMA cyclic. other samples are normalized to. The Progenesis normalization The Rlr uses the median values over all the samples as its refer- was performed simultaneously with the preprocessing of the ence sample to which all the other samples in the data are nor- data. malized to. The RlrMA is similar, with the exception that the data are MA transformed before normalization, where A refers EigenMS normalization (EigenMS) to the median sample and M is calculated for each sample as The EigenMS normalization fits an analysis of variance model the difference of that sample to A. In the RlrMACyc, there is no to the data to evaluate the treatment group effect and then uses reference, but instead, the MA transformation and the normal- singular value decomposition on the model residual matrix to ization of the samples are done pairwise between two samples, identify and remove the bias [24]. The EigenMS aims at preserv- A being the average of the two samples and M the difference. ing the original differences between treatment groups while The process is iterated through all sample pairs similar to the removing the bias from the data [25]. The EigenMS normaliza- LinRegMA of [10]. The cycle is repeated three times, which has tion was implemented using the R-codes of EigenMS [24] avail- been observed to be enough to reach convergence between iter- able for download in the Sourceforge-repositories (http:// ation cycles for the algorithm [5, 10]. All the variants of the sourceforge.net/projects/eigenms/). Downloaded from https://academic.oup.com/bib/article/19/1/1/2562889 by DeepDyve user on 18 July 2022 4| Valikangas et al. Evaluation of the normalization methods presented in the Results section unless where it is explicitly stated otherwise. We evaluated the normalization methods as follows: (1) in their ability to decrease variation between technical replicates, (2) in their ability to produce data from which the truly differentially ex- Results pressed proteins can be accurately found and (3) in how well the We examined the performance of the 11 normalization meth- logFCs calculated from the normalized data corresponded to what ods in three independent spike-in data sets as well as in a was expected based on theoretical logFCs. We also evaluated mouse data set from a study on changes in mouse liver lipid whether normalizing the data globally or pairwise (i.e. based only metabolism [18]. In the spike-in data sets, the total intensities on the sample groups under comparison) affected the perform- between samples and sample groups should be almost equal. ance of the methods in the differential expression analysis. However, MS data generally show some variation in the total intensities of samples, and this was also the case in the data Intragroup variation and similarity sets used in this study (Supplementary Figures S1–S3A). This is The effect of normalization was evaluated quantitatively using especially true for the UPS1 data set (Supplementary Figure intragroup variability measures that measure the variation be- S1A). After normalization, the situation is changed, and the tween technical replicates. Low intragroup variation means total intensity levels of the samples are nearly equal high similarity between technical replicates, indicating high re- (Supplementary Figures S1–S3). The EigenMS normalization, producibility of the analysis. Intragroup variation was measured however, does not level the total intensities of different samples with PMAD, PCV and PEV. Additionally, similarity of the tech- like the other normalization methods do, rather the distribution nical replicates in sample groups was measured with the of total intensities in different samples of the EigenMS- Pearson correlation coefficient. normalized data is identical to that of the log2-transformed data. Differential expression analysis Differential expression of proteins was examined in each two- group comparison using the reproducibility-optimized test stat- Effect of normalization on intragroup variation istic (ROTS) [26] or the t-test after application of the different Normalization decreased intragroup variation measured as PMAD normalization methods in each data set. The results of the dif- between technical replicates in all data sets when compared ferential expression analyses were evaluated with receiver with the unnormalized log2-transformed data (Figure 1A–C). operating characteristic (ROC) curve analysis, where the spike- Vsn decreased PMAD significantly more than the other normaliza- in proteins were considered as true positives and the back- tion methods in all data sets (Wilcoxon signed rank test P < 0.029 ground proteins as true negatives. The normalization methods between Vsn and the other normalization methods except were ranked based on their performance in the differential ex- EigenMS in the CPTAC data set P ¼ 0.057). Analogous patterns pression analysis using the area under the ROC curve (AUC) as a were observed also for the other intragroup variability measures ranking criterion. Better ranks were assigned to normalization (PCV and PEV) (Supplementary Figure S4A–F). Similarly, intragroup methods with higher AUC values. In case of ties, the normaliza- similarity between technical replicates measured with the tion methods received equal ranks. A mean ranking with asso- Pearson correlation coefficient was highest in the Vsn-normalized ciated standard error was calculated for each normalization data in all spike-in data sets (Figure 1D–F) (Wilcoxon test <0.03 method in each data set. Also, a pooled mean ranking over all with all other methods except EigenMS in the SGSD data set P ¼ 0. the spike-in data sets was calculated for each normalization 059 and LoessF, LoessCyc, Progenesis, quantile and EigenMS in the method. The Satterthwaite approximation was used to calculate CPTAC-data P ¼ 0.052–0.266). the associated standard error for the pooled mean ranking. The normalization methods were ranked independently with each test statistic (ROTS, t-test). Effect of normalization on differential expression When detecting differential expression, ROTS has been shown The log fold changes of the spike-in and background proteins to perform better in proteomics data than the standard t-test The aim of normalization is to remove the unwanted (nonbiolo- [15], and this was the case also in the data sets used in this gical) variation from the data. In case of the spike-in data sets study (Supplementary Figure S8, Supplementary Tables S1–S2). used in this study, the levels of spike-in proteins should change, Normalizing the data improved the AUCs of the differential ex- while the levels of the background proteins should remain un- pression analysis in general (Figure 2A–C, Table1). However, changed. We examined the distributions of the logFCs of the there was considerable variation in the performance of the dif- spike-in and background proteins in data normalized with the ferent normalization methods in the different data sets tested. different methods. The benefits of normalization were most prominent in the UPS1 data set (Figure 2A, Table 1), in which all the other normal- Evaluation of the normalization types ization methods were ranked higher than the simple log2 trans- To explore if there is a difference in the performance of the nor- formation except for the EigenMS and the Quantile malization methods depending on the way in which the nor- normalization. The Vsn-normalized data had the highest AUC malization is done, the data were normalized in two ways: in every two-group comparison in the UPS1 data set when using globally and pairwise. In global normalization, the whole data ROTS (Delong’s test P < 0.04 with all the other methods). containing all the sample groups of a data set were normalized In the CPTAC and SGSD data sets, the differences between at once. In pairwise normalization, the sample groups being the normalization methods were smaller on average, but some compared in the differential expression analysis were first differences were found. In the CPTAC data set, all the normal- extracted from the unnormalized data and then normalized ization methods, except for the median normalization, ranked separately. Owing to the similarity of the results of the normal- on average higher than the log2 transformation when the differ- ization types, only results of the global normalization are ential expression was analyzed with ROTS (Figure 2B, Table 1). Downloaded from https://academic.oup.com/bib/article/19/1/1/2562889 by DeepDyve user on 18 July 2022 Normalization in Quantitative label-free proteomics | 5 Figure 1. The effect of normalization method on intragroup variation between technical replicates. The PMADs of (A) UPS1 data, (B) CPTAC data and (C) SGSD data. The Pearson correlation coefficients of (D) UPS1 data, (E) CPTAC data and (F) SGSD data. In most of the two-group comparisons in the CPTAC data, no high mean ranks regardless of the test statistic used (Table 1). significant differences in the AUCs produced by the best ranking The linear regression methods relying on an artificial reference normalization method and the other methods were observed (RlrMA and Rlr) and the local regression method using an artifi- cial reference (LoessF) also performed systematically well (Delong’s test P > 0.05), with few exceptions. In the 0.74 versus 2.2 fmol comparison, the Progenesis normalization ranked first throughout all the comparisons in all data sets (Figure 2, Table and gave a significantly higher AUC than 8 of 10 methods 1). Some of the visuals are overlapping in Figure 2. In particular, (Delong’s test P < 0.049). In the 2.2 versus 6.7 fmol comparison, LoessF is covered largely by the lines of the other normalization the Vsn normalization ranked first and gave a significantly methods: Progenesis normalization in Figure 2A and other higher AUC than 6 of 10 methods (Delong’s test P < 0.028). In the methods in Figures 2B and C. 0.25 versus 0.74 comparison, the RlrMACyc normalization method ranked best and gave an AUC significantly higher than Effect of normalization type half of the other methods (Delong’s test P < 0.044 for 5 of 10 methods). In general, whether the data were normalized globally or pair- In the SGSD data set, differences between the different nor- wise between the two groups compared did not have a major ef- malization methods and the log2 transformation were generally fect on the AUCs of the differential expression analysis (Figure small. Only five normalization methods, the Vsn, RlrMA, Rlr, 2A–C versus Supplementary Figure S5A–C). The only exceptions RlrMACyc and LoessF, ranked on average higher than the log2 were the cyclic normalization methods, LoessCyc and transformation in the SGSD data set (Table 1). In most of the RlrMACyc, which benefitted from normalizing the data pairwise two-group comparisons, there was no significant difference be- in the UPS1 data set (Figure 2A versus Supplementary Figure tween the AUC of the best ranking method and the AUCs of the S5A). This could also be seen in the MA plots of the UPS1 data, other methods (Delong’s test P > 0.05), with few exceptions. in which the data were centered well in the line M ¼ 0 in the In the 5 versus 7, 5 versus 8, 6 versus 7, 6 versus 8 and 7 versus 8 pairwise normalized data of the cyclic methods, but not in the comparisons, the Vsn normalization consistently ranked first globally normalized data of the same methods (Supplementary and gave a higher AUC than most of the other methods tested File S1). (Delong’s test P < 0.046 for 6–8 of 10 methods; Figure 2C). While no single method gave the highest AUC in every two- Effect of normalization on logFC group comparison, the Vsn normalization performed consist- ently well, giving high AUCs throughout all data sets. This re- When looking at the distribution of the logFC of the background sulted in the highest pooled mean rank across all data sets and proteins in all data sets, we can see that it is centered around Downloaded from https://academic.oup.com/bib/article/19/1/1/2562889 by DeepDyve user on 18 July 2022 6| Valikangas et al. Figure 2. The effect of normalization method on differential expression results. The AUCs of the ROC curves of differential expression analysis in (A) UPS1 data, (B) CPTAC data and (C) SGSD data globally normalized with the different methods. The x axes denote the two-group comparisons of the sample groups. Table 1. Rankings of the normalization methods based on AUCs of the ROC curves of the differential expression analysis using global normal- ization. Best mean ranking in each data set and best pooled mean ranking with each test statistic are bolded. The methods were ranked inde- pendently when using different test statistics Normalization method Statistical test UPS1 CPTAC SGSD Pooled mean Log2 ROTS 8 6 1.02 8.5 6 0.76 4.6 6 0.78 5.9 6 1.49 t-test 7.1 6 1.35 8 6 1.29 4.5 6 0.74 5.6 6 1.87 Loess_fast ROTS 3.5 6 0.34 4.7 6 1.09 4.3 6 0.51 4.2 6 1.25 t-test 1.9 6 0.55 4.7 6 0.84 5.4 6 0.5 4.5 6 1.06 Loess_cyclic ROTS 6.9 6 1.17 2.8 6 0.75 5.1 6 0.64 5.2 6 1.53 t-test 7.9 6 0.89 4.3 6 0.95 7 6 0.58 6.8 6 1.43 Rlr_scatter ROTS 6.4 6 0.5 3.5 6 0.43 4.1 6 0.48 4.5 6 0.82 t-test 6.5 6 0.78 4.3 6 1.02 4.2 6 0.46 4.7 6 137 Rlr_ma ROTS 6.3 6 0.7 3.8 6 0.48 3.9 6 0.5 4.4 6 0.98 t-test 5.7 6 0.68 4.3 6 0.99 3.9 6 0.42 4.4 6 1.3 Rlr_ma_cyclic ROTS 7.3 6 0.54 6.5 6 1.57 4.3 6 0.58 5.3 6 1.75 t-test 6.3 6 0.67 4.2 6 1.45 4.6 6 0.5 4.9 6 1.68 Vsn ROTS 1 6 0 4.3 6 1.41 2.7 6 0.46 2.5 6 1.48 t-test 3.9 6 0.31 3.8 6 0.87 3.5 6 0.34 3.6 6 0.98 Quantile ROTS 8.2 6 0.61 7.7 6 0.56 7 6 0.74 7.4 6 1.11 t-test 8.8 6 0.39 8.3 6 0.67 9.6 6 0.46 9.2 6 0.99 Median ROTS 5.9 6 0.75 9.5 6 0.72 5 6 0.68 5.8 6 1.24 t-test 6.2 6 0.98 10 6 0.54 5.3 6 0.65 6.2 6 1.16 Progenesis ROTS 3.3 6 0.67 6.7 6 1.41 6.1 6 0.75 5.5 6 1.73 t-test 3 6 0.54 5.3 6 1.17 7.1 6 0.59 5.9 6 1.41 EigenMS ROTS 9.2 6 0.76 8 6 1.32 5.5 6 0.85 6.7 6 1.74 t-test 8.6 6 0.92 8.5 6 1.18 5.3 6 0.8 6.5 6 1.51 Downloaded from https://academic.oup.com/bib/article/19/1/1/2562889 by DeepDyve user on 18 July 2022 Normalization in Quantitative label-free proteomics | 7 Figure 3. The logFC of the background proteins and representative examples of the logFC of the spike-in proteins. (A) The density distributions of the logFC of the back- ground proteins over all two-group comparisons in all data sets. The vertical dashed line corresponds to logFC of zero. The logFC of the spike-in proteins (upper boxes) and the background proteins (lower boxes) in the (B) 10 versus 25 fmol comparison of the UPS1 data and (C) the 0.74 versus 2.2 fmol comparison of the CPTAC data. The horizontal solid black lines correspond to logFC of zero, while the horizontal dashed lines correspond to the theoretical expected logFC of the spike-in proteins. zero for all the other normalization methods except for the well centered already after the logarithm transformation. In the EigenMS normalization (Figure 3A), for which the distribution UPS1 data, the data after the cyclic normalizations (RlrMACyc was identical to that of the log2 transformation. The distribu- and LoessCyc) were much more centered after pairwise normal- tion of the logFCs in the Vsn-normalized data was more concen- ization than after global normalization (Supplementary File S1). trated around zero than in data sets normalized with the other In many two-group comparisons, the quantile normalization methods, which can be seen as a narrower and higher density seemed to introduce extra patterns into the data on high inten- distribution for the Vsn-normalized data. sities not seen in the unnormalized log2-transformed data Based on the known concentrations of the spike-in proteins, (Supplementary File S1). the logFCs of the spike-in proteins were typically underesti- mated both in the normalized data as well as in the log2- Testing on mouse data transformed data (Figure 3B and C, Supplementary File S2). The In addition to the three spike-in data sets, we also compared EigenMS-normalized data gave similar estimates as the log2- the performance of the normalization methods in a mouse transformed data; the Vsn normalization gave generally more study data set, which represents a typical real study setting [18]. conservative estimates than the other normalization methods. When looking at the levels of total intensities of the samples in All the other normalization methods gave consistently similar the log2-transformed mouse data, we can see that they are un- estimates for the logFC of the spike-in proteins. In the UPS1 equal (Supplementary Figure S6A). When applying normaliza- data, the logFC of the spike-in proteins of the normalized data tion, most of the methods equalize the levels of total intensities was closer to the theoretical known logFC in general than in the of different samples, except for the EigenMS (Supplementary log2-transformed data (Supplementary File S2). Figure S6B–K). In the mouse data set, we investigated biological replicates of Visual quality inspection the same treatment group instead of technical replicates. Similar The MA plot is a common tool for exploring the bias in the data patterns for intragroup variation for data normalized with the dif- of two samples [5, 9]. Normalization aims to remove the bias ferent methods were observed as with the spike-in data sets from the data and center the data scatter of the sample pair (Figure 5). All normalization methods decreased intragroup vari- examined around the x axis (M ¼ 0) in the MA plot. In this study, ation when measured with the PMAD compared with the unnor- MA plots were drawn and observed with each normalization malized data. PMAD was smallest in the Vsn- and EigenMS- method in each two-group comparison of each data set. Based normalized data, but the differences to the other methods were on visual inspection of these plots, the Vsn normalization not significant (Wilcoxon signed rank test >0.33; Figure 5A). seems to concentrate the data more tightly both around the x Similar patterns were observed with the other intragroup meas- axis and to a narrower scale of transformed intensities than the ures PCV and PEV (Supplementary Figure S7). Intragroup similar- logarithm transformation and the other normalization methods ity measured with the Pearson correlation coefficient was highest in general (Figure 4, Supplementary File S1). In the CPTAC and among the EigenMS-normalized data, but the differences to the the SGSD data sets, the data in the two-group comparisons were other methods were small (Wilcoxon P > 0.18; Figure 5B). The Downloaded from https://academic.oup.com/bib/article/19/1/1/2562889 by DeepDyve user on 18 July 2022 8| Valikangas et al. Figure 4. Representative MA plots of the two-group comparisons after normalization with the most successful normalization method and log2 transformation in each data set. MA plots of the (A) 2 versus 10 fmol comparison of the UPS1 data, (B) 0.25 versus 2.2 fmol comparison of the CPTAC data and (C) sample 1 versus sample 4 com- parison of the SGSD data normalized with the Vsn normalization. MA plots of the (D) 2 versus 10 fmol comparison of the UPS1 data, (E) 0.25 versus 2.2 fmol comparison of the CPTAC data and (F) sample 1 versus sample 4 comparison of the SGSD data after the log2 transformation. The lighter nonblack points in the plots correspond to the spike-in proteins and the black points to the background proteins. The curve corresponds to a loess smoothing function. Figure 5. Intragroup variation between biological replicates in the mouse data normalized with the different methods. (A) The PMADs and (B) the Pearson correlation coefficients. Downloaded from https://academic.oup.com/bib/article/19/1/1/2562889 by DeepDyve user on 18 July 2022 Normalization in Quantitative label-free proteomics | 9 mouse data did not contain any spike-in proteins and thus we might be different in different data sets. Therefore, the used nor- did not have prior knowledge about expected protein changes. malization should not make too rigid assumptions about the na- Therefore, differential expression analysis was not directly ap- ture of the bias, unless we know or can estimate the bias and plicable to assess the performance of the normalization meth- purposefully want to use a method targeting specifically that ods. The same was true for the logFC. kind of bias. The Vsn, quantile and the EigenMS normalizations do not make strict assumptions about the nature of the bias and are general methods in that sense. The median and quantile normalizations were on par with Discussion most of the normalization methods in reducing intragroup vari- In the spike-in data sets examined in this study, the Vsn nor- ation, but they did not rank well in terms of differential expres- malization consistently reduced intragroup variation the most, sion analysis. It is notable, however, that even though not increased intragroup similarity the most and gave consistently having a high ranking, both methods performed consistently in high AUCs in the differential expression analysis, resulting in the differential expression analysis by not producing low AUCs the highest pooled mean ranking among the normalization in any of the two-group comparisons like the log2 transform- methods tested. The EigenMS normalization also consistently ation did in the UPS1 data set (Figure 2A). More worrying is the reduced intragroup variation more than the other methods tendency of the quantile normalization to introduce extra pat- examined, but it did not perform well in the differential expres- terns into the data on high intensities seen on many two-group sion analysis. Also, other normalization methods decreased comparisons (Supplementary File S1). The Progenesis normal- intragroup variation when compared with the unnormalized ization had the second highest ranking in the differential ex- log2-transformed data, but no major differences between them pression analysis in the UPS1 data, but ranked worse in the two were observed. In previous comparisons of normalization meth- other data sets examined (Table 1). The EigenMS behaved ods in proteomics/peptidomics focusing on intragroup variation differently from the other normalization methods examined in measures, the Vsn normalization has been ranked average [10] this study. While it was effective in reducing intragroup vari- or as among the most suitable methods [3]. Previous studies ation, it did not perform so well in the differential expression have suggested the linear regression normalization or its vari- analysis. Instead, it performed similarly as the simple log2 ants or local regression normalization to reduce intragroup vari- transformation. ation the most [3, 9, 10]. We observed the linear regression An arbitrary but commonly used cutoff value to deter- normalization variants and the local regression normalization mine differentially expressed genes and proteins is a logFC variants performing on par with the other normalization meth- of one [27–29], which corresponds to a 2-fold change in ex- ods in reducing intragroup variation, with no major differences. pression. As we noticed from the logFC plots of the data nor- However, even though we did not observe the linear and local malized with the different methods (Figure 3B and C, regression to reduce intragroup variation more than the other Supplementary File S2), the estimates for the known differ- normalization methods, we noticed that the local regression entially expressed proteins frequently remained under this method using a mean reference sample, LoessF, consistently limit even if the differentially expressed proteins were de- produced high AUCs in the differential expression analysis. The tected with great accuracy. This was especially true for the same was true for the linear regression methods using a median Vsn-normalized data, which gave conservative estimates for reference sample, Rlr and RlrMA. The local regression normal- the logFC of the spike-in proteins, but from which the spike- ization fared better in the UPS1 data set, while the linear regres- in proteins were detected with great accuracy. This warrants sion normalization performed better in the CPTAC and SGSD caution for the use of any such generic cutoff values for fil- data sets, perhaps indicating a different kind of bias in the data tering the differentially expressed proteins based on their sets. Typically, the variants using a reference sample performed logFC. better than their cyclic counterparts, with the exception of the Although Vsn performed generally well in our comparisons, cyclic loess normalization LoessCyc in the CPTAC data. the fact that it consistently underestimated the logFCs of the It became clear that the spike-in data sets in this analysis dif- spike-in proteins can be seen as a potential drawback of the fered from each other. The sample groups of the UPS1 data set method if the researcher would be interested particularly in had much larger variation in the total intensities than the other examining the logFCs of proteins. For this particular task, some two data sets, especially the SGSD data set, which had many of the other well-performing normalization methods (LoessF, sample groups with roughly similar levels of total intensities. Rlr, RlrMA) would be perhaps more suitable. Also, all of the nor- This could be because of a number of reasons, such as different malization methods studied here, excluding EigenMS, assume instrumentation or protocols/methods used, but is interesting that only a small portion of the proteins are differentially ex- from the point of normalization. The total intensities between pressed between samples and force the total intensity levels of the samples may vary from data to data also in the case of real the samples to be on the same level (Supplementary Figure S1). experimental study settings, and we would like to find a normal- This might be problematic if in fact a large number of proteins ization method that can perform as consistently as possible no are differentially expressed between samples. In such cases, matter the characteristics of the data. Notably, normalization methods like the EigenMS might be more suitable for normaliz- clearly improved the AUCs also in the CPTAC data set when ing the data. We encourage the researcher to reflect on what is compared with the unnormalized log2-transformed data (Table known beforehand about the task at hand and select the appro- 1), regardless of the fact that it had rather equal total intensity priate normalization method accordingly. levels before normalization. This emphasizes the importance of All of the normalizations in this study were performed on a consistent normalization method; even if we have a high- protein-level data. Normalization can be performed also at the quality data set with rather equal unnormalized sample levels, peptide level. The next step would be to perform a similar ex- we cannot necessarily deduce whether a simple logarithmic haustive comparison of the normalization methods on peptide transformation would suffice in delivering the truly differen- level and explore if the same methods fare well with peptide tially expressed proteins reliably. Also, the nature of the bias data. Also, the choice of peptides to be used for Downloaded from https://academic.oup.com/bib/article/19/1/1/2562889 by DeepDyve user on 18 July 2022 10 | Valikangas et al. 5. Bolstad BM, Irizarry RA, Astrand M, et al. A comparison of nor- the normalization has been demonstrated to have an effect [11], and exploring this idea in conjunction with the normalizations malization methods for high density Oligonucleotide array used in this study would be an interesting further topic. data based on variance and bias. Bioinformatics 2003;19:185–93. Based on the comparisons made in this study, normalization 6. Choe SE, Boutros M, Michelson AM, et al. Preferred analysis decreased intragroup variation in general and resulted in better methods for Affymetrix GeneChips revealed by a wholly AUCs in the differential expression analysis than the simple defined control dataset. Genome Biol 2005;6:R16. 7. Yang YH, Dudoit S, Luu P, et al. Normalization for cDNA log2 transformation in case of most of the normalization meth- ods examined. The Vsn normalization performed consistently microarray data. In: ML Bittner, YD Chen, AN Dorsel, ER well in reducing intragroup variation and in the differential ex- Dougherty (eds). Microarrays: Optical Technologies and pression analysis in all tested data sets. The local regression Informatics, Vol. 10. San Jose: SPIE, Society for Optical Engineering, 2001, 141–52. and linear regression normalizations using a reference also reduced intragroup variation compared with the unnormalized 8. Schadt EE, Li C, Ellis B, et al. Feature extraction and data and consistently delivered good AUCs in the differential normalization algorithms for high-density Oligon expression analysis. ucleotide gene expression array data. J Cell Biochem 2001;125:120–5. 9. Callister SJ, Barry RC, Adkins JN, et al. Normalization Key Points approaches for removing systematic biases associated with mass spectrometry and label-free proteomics. J Proteome Res Data generated by the MS analysis are prone to biases, 2006;5:277–86. which can be accounted for with normalization result- 10. Kultima K, Nilsson A, Scholz B, et al. Development and evalu- ing in more reliable downstream analysis. ation of normalization methods for label-free relative quanti- In total, 11 normalization methods were systematic- fication of endogenous peptides. Mol Cell Proteomics ally evaluated in this study using three spike-in and a 2009;8:2285–95. mouse label-free proteomics data sets. 11. Webb-Robertson B-JM, Matzke MM, Jacobs JM, et al. A statis- Vsn reduced variation the most between the technical tical selection strategy for normalization procedures in LC- replicates in all studied data sets and consistently per- MS proteomics experiments through dataset dependent formed well in the differential expression analysis. ranking of normalization scaling factors. Proteomics The local regression normalization using an artificial 2011;11:4736–41. reference sample (LoessF) and linear regression nor- 12. Chawade A, Sandin M, Teleman J, et al. Data processing has malization using artificial reference samples (Rlr and major impact on the outcome of quantitative label-free LC- RlrMA) also performed systematically well in the dif- MS analysis. J Proteome Res 2015;14:676–87. ferential expression analysis. 13. Cox J, Hein MY, Luber CA, et al. Accurate proteome-wide The nature and extent of the bias in the data are not label-free quantification by delayed normalization and max- generally known beforehand; the application of a con- imal peptide ratio extraction, termed MaxLFQ. Mol Cell sistent normalization method is crucial for reliable Proteomics 2014;13:2513–26. results. 14. Zhang B, Kall L, Zubarev RA. DeMix-Q: quantification- centered data processing workflow. Mol Cell Proteomics 2016;15:1467–78. 15. Pursiheimo A, Vehmas AP, Afzal S, et al. Optimization of stat- Supplementary Data istical methods impact on quantitative proteomics data. J Proteome Res 2015;14:4118–26. Supplementary data are available online at http://bib.oxford 16. Tabb DDL, Vega-Montoto L, Rudnick PA, et al. Repeatability journals.org/. and reproducibility in proteomic identifications by liquid chromatography-tandem mass spectrometry. J Proteome Res Funding 2010;9:761–76. 17. Bruderer R, Bernhardt OM, Gandhi T, et al. Extending the This study was supported by the Sigrid Juselius Foundation, limits of quantitative proteome profiling with data- JDRF [grant number 2-2013-32] and the European Research independent acquisition and application to acetaminophen Council (ERC) Starting Grant no. 677943. treated 3D liver microtissues. Mol Cell Proteomics 2015;14:1400–10. References 18. Vehmas AP, Adam M, Laajala TD, et al. Liver lipid metabolism is altered by increased circulating estrogen to androgen ratio 1. Megger DA, Bracht T, Meyer HE, et al. Label-free quantification in male mouse. J Proteomics 2016;133:66–75. in clinical proteomics. Biochim Biophys Acta 2013;1834: 19. Huber W, von Heydebreck A, Su ¨ ltmann H, et al. Variance sta- 1581–90. bilization applied to microarray data calibration and to the 2. Meissner F, Mann M. Quantitative shotgun proteomics: con- quantification of differential expression. Bioinformatics siderations for a high-quality workflow in immunology. Nat 2002;18(Suppl 1):S96–104. Immunol 2014;15:112–7. 20. R Core Team. R: A Language and Environment for Statistical 3. Chawade A, Alexandersson E, Levander F. Normalyzer: a tool Computing. R Foundation for Statistical Computing, Vienna, for rapid evaluation of normalization methods for Omics Austria, 2015. URL https://www.R-project.org/. data sets. J Proteome Res 2014;13:3114–20. 21. Venables WN, Ripley BD. Modern Applied Statistics with S. 4. Karpievitch YV, Dabney AR, Smith RD. Normalization and Fourth Edition. Springer, New York, 2002. ISBN 0-387- missing value imputation for label-free LC-MS analysis. BMC 95457-0. Bioinformatics 2012;13:S5. Downloaded from https://academic.oup.com/bib/article/19/1/1/2562889 by DeepDyve user on 18 July 2022 Normalization in Quantitative label-free proteomics | 11 26. Elo L, File ´ n S, Lahesmaa R, et al. Reproducibility-optimized 22. Ritchie ME, Phipson B, Wu D, et al. limma powers differential expression analyses for RNA-sequencing and microarray test statistic for ranking genes in microarray studies. IEEE/ studies. Nucleic Acids Res 2015;43:e47. ACM Trans Comput Biol Bioinform 2008;5:423–31. 27. Quackenbush J. Microarray data normalization and trans- 23. Bolstad BM. preprocessCore: A collection of pre-processing formation. Nat Genet 2002;32:496–501. functions. R package version 1.32.0, https://github.com/ 28. DeRisi J, Penland L, Brown PO, et al. Use of a cDNA microarray bmbolstad/preprocessCore. to analyse gene expression patterns in human cancer. Nat 24. Karpievitch YV, Taverner T, Adkins JN, et al. Normalization Genet 1996;14:457–60. of peak intensities in bottom-up MS-based proteomics using singular value decomposition. Bioinformatics 2009;25: 29. Cho SH, Goodlett D, Franzblau S. ICAT-based comparative prote- omic analysis of non-replicating persistent Mycobacterium tu- 2573–80. berculosis. Tuberculosis 2006;86:445–60. 25. Karpievitch YV, Nikolic SB, Wilson R, et al. Metabolomics data normalization with EigenMS. PLoS One 2014;9:e116221.

Journal

Briefings in BioinformaticsOxford University Press

Published: Jan 1, 2018

There are no references for this article.