Approximately independent linkage disequilibrium blocks in human populations

Tomaz Berisa; Joseph K. Pickrell

doi:10.1093/bioinformatics/btv546

Approximately independent linkage disequilibrium blocks in human populations

Berisa, Tomaz; Pickrell, Joseph K. 2015-09-22 00:00:00 Summary: We present a method to identify approximately independent blocks of linkage disequi- librium in the human genome. These blocks enable automated analysis of multiple genome-wide association studies. Availability and implementation: code: http://bitbucket.org/nygcresearch/ldetect; data: http://bit- bucket.org/nygcresearch/ldetect-data. Contact: [email protected] Supplementary information: Supplementary data are available at Bioinformatics online. To define approximately independent LD blocks, Loh et al. 1 Introduction (2015) used non-overlapping segments of 1 megabase, and Pickrell The genome-wide association study (GWAS) is a commonly used (2014) used non-overlapping segments of 5000 SNPs. The break- study design for the identification of genetic variants that influence points of these segments undoubtedly sometimes fall in regions of complex traits. In this type of study, millions of genetic variants are strong LD, thus potentially splitting a single association signal over genotyped on thousands to millions of individuals, and each variant two blocks (and leading to over-counting of the number of associ- is tested to see whether an individual’s genotype is predictive of their ated variants). A better approximation could be obtained by con- phenotypes. Because of linkage disequilibrium (LD) in the genome sidering the empirical patterns of LD in a reference panel (e.g. (Pritchard and Przeworski, 2001), a single genetic variant with a Anderson and Novembre 2003; Greenspan and Geiger 2004; causal effect on the phenotype leads to multiple statistical (but non- Mannila et al. 2003). In the remainder of this article, we present an causal) associations at nearby variants. One initial analysis goal in a efficient signal processing-based heuristic for choosing approximate GWAS is to count the number of independent association signals in segment boundaries. the genome while accounting for LD. The most commonly used approach to counting independent sin- gle-nucleotide polymorphisms (SNPs) that influence a trait is to 2 Approach and results count ‘peaks’ of association signals—this can be done manually when the number of peaks is small (e.g. Wellcome Trust Case To estimate LD between pairs of SNPs, we use the r metric. If a Control Consortium 2007) or in a semi-automated way when the genetic variant is in LD with another genetic variant that has a number of peaks is larger (e.g. Jostins et al. 2012). There are also causal influence on disease, then r (times the strength of association fully automated methods that use LD patterns estimated from large at the causal SNP) is proportional to the association statistic at the reference panels of individuals (Yang et al., 2012). In some contexts non-causal SNP (Pritchard and Przeworski, 2001). For our pur- (e.g. when performing identical analysis on multiple GWAS with the poses, we define two sets of SNPs as ‘approximately independent’ if goal of comparing phenotypes), it is useful to define approximately the pairwise r between SNPs in different sets is close to zero. independent LD blocks a priori rather than letting them vary across Our approach is a heuristic for choosing segment boundaries, analyses performed on different phenotypes (Loh et al., 2015; given a mean segment size (which is the required input). Let there be Pickrell, 2014). n genetic variants on a chromosome. The method can be broken V The Author 2015. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: [email protected] 283 284 T.Berisa and J.K.Pickrell Fig. 1. (a) and (b) Schematic of the conversion of matrix P to vector V. (c) Example data (blue) with Hann ﬁlter applied (red). (d) Example of Crohn’s disease GWAS hits with partially ﬁltered vector V and comparison of breakpoints down into the following basic steps (see the Supplementary Material signal processing technique dubbed low-pass filtering [utilizing a for details): Hann window (Blackman and Tukey, 1958)] in Step 4. The result of applying a low-pass filter (with width ¼ 100) is shown in red in 1. Calculate the n n covariance matrix C for all pairs of SNPs Figure 1c. using the shrinkage estimator of C from Wen and Stephens Applying wider and wider filters to vector V in Step 4 allows us (2010). to focus on the large scale structure of LD blocks but also causes the 2. Convert the covariance matrix to n n matrix of squared approach to miss small scale variation around identified minima. To Pearson product-moment correlation coefﬁcients P. counteract this effect, Step 5 conducts a local search in the proximity 3. Convert the matrix P ¼ðe Þ to a ð2n 1Þ-dimensional vector V i;j of each local minimum identified in Step 4 to find the closest SNP l P P ¼ðv Þ as follows: with min e . ij i<l j>l We applied this method to sequencing data from European, African and East Asian populations in the 1000 Genomes Phase 1 k < e ; if 1i; jn i;j dataset. We set a mean block size of 10 000 SNPs and used the algo- v ¼ t ; t ¼ ; ðk ¼ 1; 2; :::; 2n 1Þ k i;kiþ1 i;j rithm to define the block boundaries. As expected, these boundaries i¼1 0; otherwise fall in regions with considerably higher recombination rates than the genome-wide average (Supplementary Fig. S4). In Figure 1d, we The effect of this step is representing each antidiagonal of P by show an example from GWAS results for Crohn’s disease (Jostins et the sum of its elements (Fig. 1a and b). This step has similarities to al., 2012) where using uniformly distributed breakpoints would re- Bulik-Sullivan et al. (2015), where the authors represent each col- sult in double-counting of an association signal, whereas the LD- umn by the sum of its elements. The method presented in this article aware breakpoints avoid stretches of SNPs in LD. uses the antidiagonal to differentiate between neighboring blocks of To test whether this approach is useful more generally, we ran similar size. fgwas (Pickrell, 2014) on GWAS of Crohn’s disease (Jostins et al., 4. Apply low-pass ﬁlters of increasing widths to (i.e. ‘smooth’) V 2012) and height (Wood et al., 2014), using both uniformly distrib- until the requested number of minima is achieved. uted breakpoints and LD-aware breakpoints. Using the LD-aware 5. Perform a local search in the proximity of each minimum from breakpoints successfully eliminated double-counting of SNPs in Step 4 to ﬁne tune the segment boundaries. moderate-to-high LD and on opposite sides of uniform breakpoints (Supplementary Material Section S6). In reality, matrix P turns out to be sparse, approximately banded and approximately block-diagonal, with sporadically overlapping blocks (Slatkin, 2008; Wall and Pritchard, 2003; Wen and Stephens, 2010). To provide intuition for Step 3, Figure 1a shows a simplified ex- Funding ample of a correlation matrix P, where two SNPs i and j are either This work was supported by the National Institutes of Health (NIH) R01 correlated (represented by 1 in element e of the matrix) or uncorre- ij grant number MH106842 to Joseph K. Pickrell. lated (represented by zero, not shown). Representing each antidiago- nal of P by the sum of its elements results in the vector shown in Conﬂict of Interest: none declared. Figure 1b and identifying segments representing blocks of LD re- duces to identifying local (or more stringently, global) minima in References this vector. In reality, the elements e of P are continuous values ij from the interval ½0; 1 and result in an extremely noisy vector V (ex- Anderson,E.C. and Novembre,J. (2003) Finding haplotype block boundaries ample in blue in Fig. 1c) Therefore, to identify large-scale trends of by using the minimum-description-length principle. Am. J. Hum. Genet., LD and reduce high frequency components in the signal, we apply a 73, 336–354. Approximately independent LD blocks 285 Blackman,R.B. and Tukey,J.W. (1958) The measurement of power spectra Pritchard,J.K. and Przeworski,M. (2001) Linkage disequilibrium in humans: from the point of view of communications engineering—part i. Bell Syst. models and data. Am. J. Hum. Genet., 69, 1–14. Tech. J., 37, 185–282. Slatkin,M. (2008) Linkage disequilibrium? Understanding the evolu- Bulik-Sullivan,B.K. et al. (2015) LD score regression distinguishes confound- tionary past and mapping the medical future. Nat. Rev. Genet., 9, ing from polygenicity in genome-wide association studies. Nat. Genet., 47, 477–485. 291–295. Wall,J.D. and Pritchard,J.K. (2003) Haplotype blocks and linkage disequilib- Greenspan,G. and Geiger,D. (2004) Model-based inference of haplotype block rium in the human genome. Nat. Rev. Genet., 4, 587–597. variation. J. Comput. Biol., 11, 493–504. Wellcome Trust Case Control Consortium (2007) Genome-wide association Jostins,L. et al. (2012) Host-microbe interactions have shaped the genetic study of 14 000 cases of seven common diseases and 3 000 shared controls. architecture of inﬂammatory bowel disease. Nature, 491, 119–124. Nature, 447, 661–78. Loh,P.-R. et al. (2015) Contrasting regional architectures of schizophrenia Wen,X. and Stephens,M. (2010) Using linear predictors to impute allele and other complex diseases using fast variance components analysis. frequencies from summary or pooled genotype data. Ann. Appl. Stat., 4, bioRxiv, doi: 10.1101/016527. 1158. Mannila,H. et al. (2003) Minimum description length block ﬁnder, a method Wood,A.R. et al. (2014) Deﬁning the role of common variation in the genomic to identify haplotype blocks and to compare the strength of block bounda- and biological architecture of adult human height. Nat. Genet., 46, 1173– ries. Am. J. Hum. Genet., 73, 86–94. 1186. Pickrell,J.K. (2014) Joint analysis of functional genomic data and genome- Yang,J. et al. (2012) Conditional and joint multiple-SNP analysis of GWAS wide association studies of 18 human traits. Am. J. Hum. Genet., 94, summary statistics identiﬁes additional variants inﬂuencing complex traits. 559–573. Nat. Genet., 44, 369–375. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Bioinformatics Oxford University Press http://www.deepdyve.com/lp/oxford-university-press/approximately-independent-linkage-disequilibrium-blocks-in-human-H096PDt1iM

Loading next page...

References (15)

Publisher: Oxford University Press
Copyright: © The Author 2015. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: [email protected]
ISSN: 1367-4803
eISSN: 1460-2059
DOI: 10.1093/bioinformatics/btv546
pmid: 26395773
Publisher site: See Article on Publisher Site

Abstract

Summary: We present a method to identify approximately independent blocks of linkage disequi- librium in the human genome. These blocks enable automated analysis of multiple genome-wide association studies. Availability and implementation: code: http://bitbucket.org/nygcresearch/ldetect; data: http://bit- bucket.org/nygcresearch/ldetect-data. Contact: [email protected] Supplementary information: Supplementary data are available at Bioinformatics online. To define approximately independent LD blocks, Loh et al. 1 Introduction (2015) used non-overlapping segments of 1 megabase, and Pickrell The genome-wide association study (GWAS) is a commonly used (2014) used non-overlapping segments of 5000 SNPs. The break- study design for the identification of genetic variants that influence points of these segments undoubtedly sometimes fall in regions of complex traits. In this type of study, millions of genetic variants are strong LD, thus potentially splitting a single association signal over genotyped on thousands to millions of individuals, and each variant two blocks (and leading to over-counting of the number of associ- is tested to see whether an individual’s genotype is predictive of their ated variants). A better approximation could be obtained by con- phenotypes. Because of linkage disequilibrium (LD) in the genome sidering the empirical patterns of LD in a reference panel (e.g. (Pritchard and Przeworski, 2001), a single genetic variant with a Anderson and Novembre 2003; Greenspan and Geiger 2004; causal effect on the phenotype leads to multiple statistical (but non- Mannila et al. 2003). In the remainder of this article, we present an causal) associations at nearby variants. One initial analysis goal in a efficient signal processing-based heuristic for choosing approximate GWAS is to count the number of independent association signals in segment boundaries. the genome while accounting for LD. The most commonly used approach to counting independent sin- gle-nucleotide polymorphisms (SNPs) that influence a trait is to 2 Approach and results count ‘peaks’ of association signals—this can be done manually when the number of peaks is small (e.g. Wellcome Trust Case To estimate LD between pairs of SNPs, we use the r metric. If a Control Consortium 2007) or in a semi-automated way when the genetic variant is in LD with another genetic variant that has a number of peaks is larger (e.g. Jostins et al. 2012). There are also causal influence on disease, then r (times the strength of association fully automated methods that use LD patterns estimated from large at the causal SNP) is proportional to the association statistic at the reference panels of individuals (Yang et al., 2012). In some contexts non-causal SNP (Pritchard and Przeworski, 2001). For our pur- (e.g. when performing identical analysis on multiple GWAS with the poses, we define two sets of SNPs as ‘approximately independent’ if goal of comparing phenotypes), it is useful to define approximately the pairwise r between SNPs in different sets is close to zero. independent LD blocks a priori rather than letting them vary across Our approach is a heuristic for choosing segment boundaries, analyses performed on different phenotypes (Loh et al., 2015; given a mean segment size (which is the required input). Let there be Pickrell, 2014). n genetic variants on a chromosome. The method can be broken V The Author 2015. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: [email protected] 283 284 T.Berisa and J.K.Pickrell Fig. 1. (a) and (b) Schematic of the conversion of matrix P to vector V. (c) Example data (blue) with Hann ﬁlter applied (red). (d) Example of Crohn’s disease GWAS hits with partially ﬁltered vector V and comparison of breakpoints down into the following basic steps (see the Supplementary Material signal processing technique dubbed low-pass filtering [utilizing a for details): Hann window (Blackman and Tukey, 1958)] in Step 4. The result of applying a low-pass filter (with width ¼ 100) is shown in red in 1. Calculate the n n covariance matrix C for all pairs of SNPs Figure 1c. using the shrinkage estimator of C from Wen and Stephens Applying wider and wider filters to vector V in Step 4 allows us (2010). to focus on the large scale structure of LD blocks but also causes the 2. Convert the covariance matrix to n n matrix of squared approach to miss small scale variation around identified minima. To Pearson product-moment correlation coefﬁcients P. counteract this effect, Step 5 conducts a local search in the proximity 3. Convert the matrix P ¼ðe Þ to a ð2n 1Þ-dimensional vector V i;j of each local minimum identified in Step 4 to find the closest SNP l P P ¼ðv Þ as follows: with min e . ij i<l j>l We applied this method to sequencing data from European, African and East Asian populations in the 1000 Genomes Phase 1 k < e ; if 1i; jn i;j dataset. We set a mean block size of 10 000 SNPs and used the algo- v ¼ t ; t ¼ ; ðk ¼ 1; 2; :::; 2n 1Þ k i;kiþ1 i;j rithm to define the block boundaries. As expected, these boundaries i¼1 0; otherwise fall in regions with considerably higher recombination rates than the genome-wide average (Supplementary Fig. S4). In Figure 1d, we The effect of this step is representing each antidiagonal of P by show an example from GWAS results for Crohn’s disease (Jostins et the sum of its elements (Fig. 1a and b). This step has similarities to al., 2012) where using uniformly distributed breakpoints would re- Bulik-Sullivan et al. (2015), where the authors represent each col- sult in double-counting of an association signal, whereas the LD- umn by the sum of its elements. The method presented in this article aware breakpoints avoid stretches of SNPs in LD. uses the antidiagonal to differentiate between neighboring blocks of To test whether this approach is useful more generally, we ran similar size. fgwas (Pickrell, 2014) on GWAS of Crohn’s disease (Jostins et al., 4. Apply low-pass ﬁlters of increasing widths to (i.e. ‘smooth’) V 2012) and height (Wood et al., 2014), using both uniformly distrib- until the requested number of minima is achieved. uted breakpoints and LD-aware breakpoints. Using the LD-aware 5. Perform a local search in the proximity of each minimum from breakpoints successfully eliminated double-counting of SNPs in Step 4 to ﬁne tune the segment boundaries. moderate-to-high LD and on opposite sides of uniform breakpoints (Supplementary Material Section S6). In reality, matrix P turns out to be sparse, approximately banded and approximately block-diagonal, with sporadically overlapping blocks (Slatkin, 2008; Wall and Pritchard, 2003; Wen and Stephens, 2010). To provide intuition for Step 3, Figure 1a shows a simplified ex- Funding ample of a correlation matrix P, where two SNPs i and j are either This work was supported by the National Institutes of Health (NIH) R01 correlated (represented by 1 in element e of the matrix) or uncorre- ij grant number MH106842 to Joseph K. Pickrell. lated (represented by zero, not shown). Representing each antidiago- nal of P by the sum of its elements results in the vector shown in Conﬂict of Interest: none declared. Figure 1b and identifying segments representing blocks of LD re- duces to identifying local (or more stringently, global) minima in References this vector. In reality, the elements e of P are continuous values ij from the interval ½0; 1 and result in an extremely noisy vector V (ex- Anderson,E.C. and Novembre,J. (2003) Finding haplotype block boundaries ample in blue in Fig. 1c) Therefore, to identify large-scale trends of by using the minimum-description-length principle. Am. J. Hum. Genet., LD and reduce high frequency components in the signal, we apply a 73, 336–354. Approximately independent LD blocks 285 Blackman,R.B. and Tukey,J.W. (1958) The measurement of power spectra Pritchard,J.K. and Przeworski,M. (2001) Linkage disequilibrium in humans: from the point of view of communications engineering—part i. Bell Syst. models and data. Am. J. Hum. Genet., 69, 1–14. Tech. J., 37, 185–282. Slatkin,M. (2008) Linkage disequilibrium? Understanding the evolu- Bulik-Sullivan,B.K. et al. (2015) LD score regression distinguishes confound- tionary past and mapping the medical future. Nat. Rev. Genet., 9, ing from polygenicity in genome-wide association studies. Nat. Genet., 47, 477–485. 291–295. Wall,J.D. and Pritchard,J.K. (2003) Haplotype blocks and linkage disequilib- Greenspan,G. and Geiger,D. (2004) Model-based inference of haplotype block rium in the human genome. Nat. Rev. Genet., 4, 587–597. variation. J. Comput. Biol., 11, 493–504. Wellcome Trust Case Control Consortium (2007) Genome-wide association Jostins,L. et al. (2012) Host-microbe interactions have shaped the genetic study of 14 000 cases of seven common diseases and 3 000 shared controls. architecture of inﬂammatory bowel disease. Nature, 491, 119–124. Nature, 447, 661–78. Loh,P.-R. et al. (2015) Contrasting regional architectures of schizophrenia Wen,X. and Stephens,M. (2010) Using linear predictors to impute allele and other complex diseases using fast variance components analysis. frequencies from summary or pooled genotype data. Ann. Appl. Stat., 4, bioRxiv, doi: 10.1101/016527. 1158. Mannila,H. et al. (2003) Minimum description length block ﬁnder, a method Wood,A.R. et al. (2014) Deﬁning the role of common variation in the genomic to identify haplotype blocks and to compare the strength of block bounda- and biological architecture of adult human height. Nat. Genet., 46, 1173– ries. Am. J. Hum. Genet., 73, 86–94. 1186. Pickrell,J.K. (2014) Joint analysis of functional genomic data and genome- Yang,J. et al. (2012) Conditional and joint multiple-SNP analysis of GWAS wide association studies of 18 human traits. Am. J. Hum. Genet., 94, summary statistics identiﬁes additional variants inﬂuencing complex traits. 559–573. Nat. Genet., 44, 369–375.

Journal

Bioinformatics – Oxford University Press

Published: Sep 22, 2015

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

Approximately independent linkage disequilibrium blocks in human populations

Approximately independent linkage disequilibrium blocks in human populations

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

Approximately independent linkage disequilibrium blocks in human populations

Approximately independent linkage disequilibrium blocks in human populations

References (15)

Abstract

Journal

Recommended Articles

There are no references for this article.

Our policy towards the use of cookies