Biometrics 74, 155–164 DOI: 10.1111/biom.12708
Computation of Ancestry Scores with Mixed Families and Unrelated
Yi-Hui Zhou ,
James S. Marron,
and Fred A. Wright
Department of Biological Sciences, Bioinformatics Research Center, North Carolina State University, Raleigh,
North Carolina, U.S.A.
Department of Statistics and Operations Research, University of North Carolina, Chapel Hill, U.S.A.
Department of Biological Sciences and Statistics, Bioinformatics Research Center, North Carolina State
University, Raleigh, U.S.A.
email: yihui firstname.lastname@example.org
Summary. The issue of robustness to family relationships in computing genotype ancestry scores such as eigenvector pro-
jections has received increased attention in genetic association, and is particularly challenging when sets of both unrelated
individuals and closely related family members are included. The current standard is to compute loadings (left singular vec-
tors) using unrelated individuals and to compute projected scores for remaining family members. However, projected ancestry
scores from this approach suﬀer from shrinkage toward zero. We consider two main novel strategies: (i) matrix substitution
based on decomposition of a target family-orthogonalized covariance matrix, and (ii) using family-averaged data to obtain
loadings. We illustrate the performance via simulations, including resampling from 1000 Genomes Project data, and analysis
of a cystic ﬁbrosis dataset. The matrix substitution approach has similar performance to the current standard, but is simple
and uses only a genotype covariance matrix, while the family-average method shows superior performance. Our approaches
are accompanied by novel ancillary approaches that provide considerable insight, including individual-speciﬁc eigenvalue scree
Key words: Genetic association; Population stratiﬁcation; Principal components.
Diﬀering ancestries of human subpopulations create system-
atic diﬀerences in genetic allele frequencies across the genome,
a phenomenon known as population stratiﬁcation or substruc-
ture. If a phenotypic trait such as disease is associated with
subpopulation membership, a genetic association study can
identify spurious relationships with genetic markers. Singular
value decomposition (SVD) of genotype data or eigen decom-
position of covariance matrices can be used to identify pop-
ulation stratiﬁcation. The eigenvectors (essentially principal
component scores) that correspond to large eigenvalues can be
used as covariates in association analysis (Levine et al., 2013).
The combined analysis of unrelated and related individuals is
a common feature of genetic association studies (Zhu et al.,
2008). However, the presence of close-degree relatives in a
genetic dataset presents diﬃculties, as the family structure
can greatly inﬂuence the eigenvalues and eigenvectors.
Cystic ﬁbrosis (CF) is a recessive genetic lung disorder,
caused by a mutation in the single gene CFTR. However, con-
siderable genetic variation remains in the severity of disease,
and evidence indicates this variation is complex and inﬂu-
enced by numerous genes (Wright et al., 2011). Genotypes
gathered by the North American CF Consortium are typical
of a large-scale genomewide association study (GWAS), with
thousands of individuals and over 1 million genetic markers
(Corvol et al., 2015). For covariate control, the eigenvec-
tors are computed for a submatrix of the genotypes, after
a “thinning” process in which only an ancestry-informative
subset of markers which have low marker–marker correla-
tion is retained (Patterson et al., 2006). We illustrate the
proposed methods using the dataset from the CF patients
described as “GWAS1” in Corvol et al. (2015), with 21,205
thinned ancestry markers and 3444 individuals. The dataset
includes 2546 singletons (unrelated to others) and 438 small
families of siblings (417 sets of 2 individuals, 20 sets of 3,
and 1 set of 4). Figure 1 is a scatter plot of the ﬁfth ver-
sus the ﬁrst “ancestry scores” (right singular vectors for
this example) from a naive analysis of all 3444 individuals
(see Section 2).
Here, the PC5 scores are driven largely by membership
in the family of size 4, rather than the ancestry substruc-
ture of interest. Several additional top-ranked eigenvectors
are also driven by family membership. Accordingly, matrix
projection methods have been proposed (Zhu et al., 2008),
in which singular value decomposition is performed on sin-
gletons, followed by projections for the remaining families.
However, this approach has been shown to produce shrunken
projected scores for the family members (Lee et al., 2010). In
Conomos et al. (2015), the PCAiR method was proposed to
expand the set of individuals included in the SVD to include a
single individual from each family, resulting in improved per-
formance. However, the question remains as to whether scores
for the remaining projected individuals will exhibit shrinkage,
or if the methods can be further improved.
2017 The Authors. Biometrics published by Wiley Periodicals, Inc. on behalf of International Biometric Society
This is an open access article under the terms of the Creative Commons Attribution License, which permits use,
distribution and reproduction in any medium, provided the original work is properly cited.