R2DGC: threshold-free peak alignment and identification for 2D gas chromatography-mass spectrometry in R

R2DGC: threshold-free peak alignment and identification for 2D gas chromatography-mass... Abstract Summary Comprehensive 2D gas chromatography-mass spectrometry is a powerful method for analyzing complex mixtures of volatile compounds, but produces a large amount of raw data that requires downstream processing to align signals of interest (peaks) across multiple samples and match peak characteristics to reference standard libraries prior to downstream statistical analysis. Very few existing tools address this aspect of analysis and those that do have shortfalls in usability or performance. We have developed an R package that implements retention time and mass spectra similarity threshold-free alignments, seamlessly integrates retention time standards for universally reproducible alignments, performs common ion filtering and provides compatibility with multiple peak quantification methods. We demonstrate that our package’s performance compares favorably to existing tools on a controlled mix of metabolite standards separated under variable chromatography conditions and data generated from cell lines. Availability and implementation R2DGC can be downloaded at https://github.com/rramaker/R2DGC or installed via the Comprehensive R Archive Network (CRAN). Contact sjcooper@hudsonalpha.org Supplementary information Supplementary data are available at Bioinformatics online. 1 Introduction Metabolomics is a rapidly growing field that seeks to comprehensively measure a set of small molecule metabolites in a sample (Wishart, 2016). 2D gas chromatography-mass spectrometry (2D-GCMS) was developed to improve chromatographic separation thus allowing for improved measurement of complex mixtures from biological samples (Mondello et al., 2008). 2D-GCMS couples two gas chromatography (GC) columns with complementary chemistries, to a time-of-flight mass spectrometer. The resulting data contain three identifying components for each feature (metabolite): two retention times, one from each chromatographic separation, which reflect the affinity of a compound to each column and a mass spectrum that is relatively unique to each compound. Each component has associated error. Retention times can vary across experiments or instruments due to slight differences in chromatography methods or slight variations in column length, chemistry and age. Furthermore, subjecting metabolites to electrospray ionization, the most common ionization method for 2D-GCMS, can lead to similar mass spectra among structurally related compounds. However, combining these three components provides a powerful method for de-convoluting complex mixtures. Here, we describe a novel software package to streamline the data processing tasks of peak alignment and metabolite identification. We also provide a retention indexed reference library containing information on 298 peaks derived from over 125 metabolite standards and commonly observed background peaks for use with the package. 2 Materials and methods R2DGC is designed to take individual sample files containing basic peak information and perform pre-processing steps, generate an alignment table with areas of peaks common to multiple samples and match aligned peaks to a reference library (Fig. 1). Fig. 1. View largeDownload slide R2DGC alignment pipeline workflow Fig. 1. View largeDownload slide R2DGC alignment pipeline workflow Input data: The input for peak alignment with R2DGC is a list of file paths to individual sample files that contain the retention times, area and mass spectra of each peak as commonly outputted by vendor software like ChromaTOF (Supplementary Section 2.2). The user can also specify the peak quantification method used to ensure proper downstream processing. For example, in order to ensure accurate quantification, the aligner will ensure all aligned peak areas were computed with the same ion, which is often referred to as the quant mass. If quant mass incongruencies exist, the user is notified and a preliminary peak area conversion is performed. If primary or secondary retention time standards were used, the user simply needs to ensure the outlined naming scheme is used to identify each standard peak and make use of retention time indexing. Sample quality control and pre-processing: The first optional pre-processing function (FindProblemIons) allows users to identify and filter problematic ions from the mass spectra of each peak to improve downstream spectral matching accuracy and efficiency (Supplementary Section 2.3). This function, as well as those described below, takes advantage of parallel computing in R to ensure alignments efficiently scale with increasing sample sizes. The second optional pre-processing function (PrecompressFiles) searches each sample file for neighboring peaks that likely represent a single compound that was incorrectly split in primary peak identification and quantification (Supplementary Section 2.4). Manually identifying these peaks in each sample is an extremely time intensive process. The third pre-processing function (MakeReference) allows users to generate a reference standard library to identify metabolites from which aligned peaks are likely derived (Supplementary Section 2.5). We have provided a pre-formatted standard reference library containing greater than 125 metabolites each with their primary retention time indexed with nine fatty acid methyl esters that is installed with the package. Multi-sample peak alignment: The final and only required function (ConsensusAlign) takes sample files and an optional metabolite standard library and aligns common peaks across samples. It then identifies likely metabolites from which each aligned peak was derived. The aligner computes pairwise similarity scores between each peak in each sample file to each peak in a seed file. Additional detail on algorithms used by the package is provided in the Supplementary Material. Briefly, similarity scores are computed as the dot product of each mass spectra penalized by differences in peak retention times. An optimal similarity score threshold (maximizing the number of peaks with at least one match while minimizing the total number of potential matches) for peak alignment is computed prior to aligning peaks. Alternatively, the user can manually alter the stringency of the alignment. After primary alignment, an optional second alignment can be performed that searches for ‘high likelihood’ missing peaks that are present in nearly all samples at a slightly relaxed similarity threshold. To ensure the seed file selected does not bias the alignment, the user can elect to use multiple seed files and retain only peaks aligned by a majority of seed files. This function outputs an alignment table with peak areas for each sample (columns) by each aligned peak (row), a peak information table with retention times, mass spectra, reference library matches for each peak and, depending on the peak quantification method used, a list of peaks that were aligned, but whose peak areas were computed using different masses (Supplementary Section 2.6). 3 Results We have extensively tested the package on both known mixtures of metabolite standards, and extracts derived from tissues and cell lines. We provide results from two such test cases to demonstrate the utility of the package (https://github.com/rramaker/R2DGC_Datasets). The first test case involves six total samples representing three different known mixtures of amino acids analyzed with two different chromatography methods (Fig. 2A, Supplementary Section 4.1). The second is a more typical metabolomics test set that includes 24 samples from pancreatic cancer cell lines for which 12 amino acids of each sample were manually aligned to benchmark aligner performance (Fig. 2B and C, Supplementary Section 4.2). Under these conditions, we find our alignment pipeline outperforms two previously developed open source alignment pipelines (Castillo et al., 2011; Jeong et al., 2012) resulting in more true alignments, fewer incorrect or missing alignments and a greater number of total peaks aligned across a majority of samples (Fig. 2A–C). We attribute the improvement to incorporation of intelligent thresholding and other data quality adjustments implemented in our package such as retention indexing and common ion filtering. Fig. 2. View largeDownload slide Performance measured as the number of correct alignments divided by number of incorrect or missing alignments on a (A) mix of amino acid standards run under variable chromatography conditions and (B) manually identified amino acids from cell line extracts. (C) Total number of peaks aligned across 75% of sample from cell line extracts Fig. 2. View largeDownload slide Performance measured as the number of correct alignments divided by number of incorrect or missing alignments on a (A) mix of amino acid standards run under variable chromatography conditions and (B) manually identified amino acids from cell line extracts. (C) Total number of peaks aligned across 75% of sample from cell line extracts 4 Conclusions R2DGC provides a comprehensive, efficient and reproducible pipeline for 2D-GCMS peak alignment and identification. It is one of the few freely available and actively maintained 2D-GCMS aligners currently available and improves upon existing software by (i) calculating optimal retention time or mass spectra similarity thresholds rather than an arbitrary designation, (ii) facilitating the use of retention time standards to allow for universally reproducible alignments, (iii) filtering ions that hinder mass spectra comparisons, and (iv) providing compatibility for multiple peak quantification methods. Funding This work was supported by the National Institutes of Health-National Institute of General Medical Sciences Medical Scientist Training Program [5T32GM008361-21 to R.C.R.]. Conflict of Interest: none declared. References Castillo S. et al.   ( 2011) Data analysis tool for comprehensive two-dimensional gas chromatography/time-of-flight mass spectrometry. Anal. Chem ., 83, 3058– 3067. Google Scholar CrossRef Search ADS PubMed  Jeong J. et al.   ( 2012) Model-based peak alignment of metabolomic profiling from comprehensive two-dimensional gas chromatography mass spectrometry. BMC Bioinformatics , 13, 27. Google Scholar CrossRef Search ADS PubMed  Mondello L. et al.   ( 2008) Comprehensive two-dimensional gas chromatography-mass spectrometry: a review. Mass Spectrom. Rev ., 27, 101– 124. Google Scholar CrossRef Search ADS PubMed  Wishart D.S. ( 2016) Emerging applications of metabolomics in drug discovery and precision medicine. Nat. Rev. Drug Discov ., 15, 473– 484. Google Scholar CrossRef Search ADS PubMed  © The Author(s) 2017. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/about_us/legal/notices) http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Bioinformatics Oxford University Press

R2DGC: threshold-free peak alignment and identification for 2D gas chromatography-mass spectrometry in R

Loading next page...
 
/lp/ou_press/r2dgc-threshold-free-peak-alignment-and-identification-for-2d-gas-WJPqhFEjcj
Publisher
Oxford University Press
Copyright
© The Author(s) 2017. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com
ISSN
1367-4803
eISSN
1460-2059
D.O.I.
10.1093/bioinformatics/btx825
Publisher site
See Article on Publisher Site

Abstract

Abstract Summary Comprehensive 2D gas chromatography-mass spectrometry is a powerful method for analyzing complex mixtures of volatile compounds, but produces a large amount of raw data that requires downstream processing to align signals of interest (peaks) across multiple samples and match peak characteristics to reference standard libraries prior to downstream statistical analysis. Very few existing tools address this aspect of analysis and those that do have shortfalls in usability or performance. We have developed an R package that implements retention time and mass spectra similarity threshold-free alignments, seamlessly integrates retention time standards for universally reproducible alignments, performs common ion filtering and provides compatibility with multiple peak quantification methods. We demonstrate that our package’s performance compares favorably to existing tools on a controlled mix of metabolite standards separated under variable chromatography conditions and data generated from cell lines. Availability and implementation R2DGC can be downloaded at https://github.com/rramaker/R2DGC or installed via the Comprehensive R Archive Network (CRAN). Contact sjcooper@hudsonalpha.org Supplementary information Supplementary data are available at Bioinformatics online. 1 Introduction Metabolomics is a rapidly growing field that seeks to comprehensively measure a set of small molecule metabolites in a sample (Wishart, 2016). 2D gas chromatography-mass spectrometry (2D-GCMS) was developed to improve chromatographic separation thus allowing for improved measurement of complex mixtures from biological samples (Mondello et al., 2008). 2D-GCMS couples two gas chromatography (GC) columns with complementary chemistries, to a time-of-flight mass spectrometer. The resulting data contain three identifying components for each feature (metabolite): two retention times, one from each chromatographic separation, which reflect the affinity of a compound to each column and a mass spectrum that is relatively unique to each compound. Each component has associated error. Retention times can vary across experiments or instruments due to slight differences in chromatography methods or slight variations in column length, chemistry and age. Furthermore, subjecting metabolites to electrospray ionization, the most common ionization method for 2D-GCMS, can lead to similar mass spectra among structurally related compounds. However, combining these three components provides a powerful method for de-convoluting complex mixtures. Here, we describe a novel software package to streamline the data processing tasks of peak alignment and metabolite identification. We also provide a retention indexed reference library containing information on 298 peaks derived from over 125 metabolite standards and commonly observed background peaks for use with the package. 2 Materials and methods R2DGC is designed to take individual sample files containing basic peak information and perform pre-processing steps, generate an alignment table with areas of peaks common to multiple samples and match aligned peaks to a reference library (Fig. 1). Fig. 1. View largeDownload slide R2DGC alignment pipeline workflow Fig. 1. View largeDownload slide R2DGC alignment pipeline workflow Input data: The input for peak alignment with R2DGC is a list of file paths to individual sample files that contain the retention times, area and mass spectra of each peak as commonly outputted by vendor software like ChromaTOF (Supplementary Section 2.2). The user can also specify the peak quantification method used to ensure proper downstream processing. For example, in order to ensure accurate quantification, the aligner will ensure all aligned peak areas were computed with the same ion, which is often referred to as the quant mass. If quant mass incongruencies exist, the user is notified and a preliminary peak area conversion is performed. If primary or secondary retention time standards were used, the user simply needs to ensure the outlined naming scheme is used to identify each standard peak and make use of retention time indexing. Sample quality control and pre-processing: The first optional pre-processing function (FindProblemIons) allows users to identify and filter problematic ions from the mass spectra of each peak to improve downstream spectral matching accuracy and efficiency (Supplementary Section 2.3). This function, as well as those described below, takes advantage of parallel computing in R to ensure alignments efficiently scale with increasing sample sizes. The second optional pre-processing function (PrecompressFiles) searches each sample file for neighboring peaks that likely represent a single compound that was incorrectly split in primary peak identification and quantification (Supplementary Section 2.4). Manually identifying these peaks in each sample is an extremely time intensive process. The third pre-processing function (MakeReference) allows users to generate a reference standard library to identify metabolites from which aligned peaks are likely derived (Supplementary Section 2.5). We have provided a pre-formatted standard reference library containing greater than 125 metabolites each with their primary retention time indexed with nine fatty acid methyl esters that is installed with the package. Multi-sample peak alignment: The final and only required function (ConsensusAlign) takes sample files and an optional metabolite standard library and aligns common peaks across samples. It then identifies likely metabolites from which each aligned peak was derived. The aligner computes pairwise similarity scores between each peak in each sample file to each peak in a seed file. Additional detail on algorithms used by the package is provided in the Supplementary Material. Briefly, similarity scores are computed as the dot product of each mass spectra penalized by differences in peak retention times. An optimal similarity score threshold (maximizing the number of peaks with at least one match while minimizing the total number of potential matches) for peak alignment is computed prior to aligning peaks. Alternatively, the user can manually alter the stringency of the alignment. After primary alignment, an optional second alignment can be performed that searches for ‘high likelihood’ missing peaks that are present in nearly all samples at a slightly relaxed similarity threshold. To ensure the seed file selected does not bias the alignment, the user can elect to use multiple seed files and retain only peaks aligned by a majority of seed files. This function outputs an alignment table with peak areas for each sample (columns) by each aligned peak (row), a peak information table with retention times, mass spectra, reference library matches for each peak and, depending on the peak quantification method used, a list of peaks that were aligned, but whose peak areas were computed using different masses (Supplementary Section 2.6). 3 Results We have extensively tested the package on both known mixtures of metabolite standards, and extracts derived from tissues and cell lines. We provide results from two such test cases to demonstrate the utility of the package (https://github.com/rramaker/R2DGC_Datasets). The first test case involves six total samples representing three different known mixtures of amino acids analyzed with two different chromatography methods (Fig. 2A, Supplementary Section 4.1). The second is a more typical metabolomics test set that includes 24 samples from pancreatic cancer cell lines for which 12 amino acids of each sample were manually aligned to benchmark aligner performance (Fig. 2B and C, Supplementary Section 4.2). Under these conditions, we find our alignment pipeline outperforms two previously developed open source alignment pipelines (Castillo et al., 2011; Jeong et al., 2012) resulting in more true alignments, fewer incorrect or missing alignments and a greater number of total peaks aligned across a majority of samples (Fig. 2A–C). We attribute the improvement to incorporation of intelligent thresholding and other data quality adjustments implemented in our package such as retention indexing and common ion filtering. Fig. 2. View largeDownload slide Performance measured as the number of correct alignments divided by number of incorrect or missing alignments on a (A) mix of amino acid standards run under variable chromatography conditions and (B) manually identified amino acids from cell line extracts. (C) Total number of peaks aligned across 75% of sample from cell line extracts Fig. 2. View largeDownload slide Performance measured as the number of correct alignments divided by number of incorrect or missing alignments on a (A) mix of amino acid standards run under variable chromatography conditions and (B) manually identified amino acids from cell line extracts. (C) Total number of peaks aligned across 75% of sample from cell line extracts 4 Conclusions R2DGC provides a comprehensive, efficient and reproducible pipeline for 2D-GCMS peak alignment and identification. It is one of the few freely available and actively maintained 2D-GCMS aligners currently available and improves upon existing software by (i) calculating optimal retention time or mass spectra similarity thresholds rather than an arbitrary designation, (ii) facilitating the use of retention time standards to allow for universally reproducible alignments, (iii) filtering ions that hinder mass spectra comparisons, and (iv) providing compatibility for multiple peak quantification methods. Funding This work was supported by the National Institutes of Health-National Institute of General Medical Sciences Medical Scientist Training Program [5T32GM008361-21 to R.C.R.]. Conflict of Interest: none declared. References Castillo S. et al.   ( 2011) Data analysis tool for comprehensive two-dimensional gas chromatography/time-of-flight mass spectrometry. Anal. Chem ., 83, 3058– 3067. Google Scholar CrossRef Search ADS PubMed  Jeong J. et al.   ( 2012) Model-based peak alignment of metabolomic profiling from comprehensive two-dimensional gas chromatography mass spectrometry. BMC Bioinformatics , 13, 27. Google Scholar CrossRef Search ADS PubMed  Mondello L. et al.   ( 2008) Comprehensive two-dimensional gas chromatography-mass spectrometry: a review. Mass Spectrom. Rev ., 27, 101– 124. Google Scholar CrossRef Search ADS PubMed  Wishart D.S. ( 2016) Emerging applications of metabolomics in drug discovery and precision medicine. Nat. Rev. Drug Discov ., 15, 473– 484. Google Scholar CrossRef Search ADS PubMed  © The Author(s) 2017. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/about_us/legal/notices)

Journal

BioinformaticsOxford University Press

Published: Dec 21, 2017

There are no references for this article.

You’re reading a free preview. Subscribe to read the entire article.


DeepDyve is your
personal research library

It’s your single place to instantly
discover and read the research
that matters to you.

Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month

Explore the DeepDyve Library

Search

Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly

Organize

Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.

Access

Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.

Your journals are on DeepDyve

Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.

All the latest content is available, no embargo periods.

See the journals in your area

DeepDyve

Freelancer

DeepDyve

Pro

Price

FREE

$49/month
$360/year

Save searches from
Google Scholar,
PubMed

Create lists to
organize your research

Export lists, citations

Read DeepDyve articles

Abstract access only

Unlimited access to over
18 million full-text articles

Print

20 pages / month

PDF Discount

20% off