# The Use of the Expectation–Maximization (EM) Algorithm for Maximum Likelihood Estimation of Gametic Frequencies of Multilocus Polymorphic Codominant Systems Based on Sampled Population Data

The Use of the Expectation–Maximization (EM) Algorithm for Maximum Likelihood Estimation of... Estimation of gametic frequencies in multilocus polymorphic systems based on the numerical distribution of multilocus genotypes in a population sample (“analysis without pedigrees”) is difficult because some gametes are not recognized in the data obtained. Even in the case of codominant systems, where all alleles can be recognized by genotypes, so that direct estimation of the frequencies of genes (alleles) is possible (“complete data”), estimation of the frequencies of multilocus gametes based on the data on multilocus genotypes is sometimes impossible, whether population data or even family data are used for studying genotypic segregation or analysis of linkage (“incomplete data”). Such “incomplete data” are analyzed based on the corresponding genetic models using the expectation–maximization (EM) algorithm. In this study, the EM algorithm based on the random-marriage model for a nonsubdivided population was used to estimate gametic frequencies. The EM algorithm used in the study does not set any limitations on the number of loci and the number of alleles of each locus. Locus and alleles are identified by numeration making possible to arrange loops. In each combination of alleles for a given combination of m out of L loci (L is the total number of loci studied), all alleles are assigned value 1, and the remaining alleles are assigned value 0. The sum of zeros and unities for each gamete is its gametic value (h), and the sum of the gametic values of the gametes that form a given genotype is the genotypic value (g) of this genotype. Then, gametes with the sameh are united into a single class, which reduces the number of the estimated parameters. In a general case of m loci, this procedure yields m + 1 classes of gametes and 2m + 1 classes of genotypes with genotypic valuesg = 0, 1, 2,... 2m. The unknown frequencies of them + 1 classes of gametes can be represented as functions of the gametic frequencies whose maximum likelihood estimations (MLEs) have been obtained in all previous EM procedures and the only unknown frequency (P m(m)) that is to be estimated in the given EM procedure. At the expectation step, the expected frequencies (F m(g) of the genotypes with genotypic valuesg are expressed in terms of the products of the frequencies of m + 1 classes of gametes. The data on genotypes are the numbers (n g) of individuals with genotypic values g = 0, 1, 2, 3, ..., 2m. The maximization step is the maximization of the logarithm of the likelihood function (LLF) for n g values. Thus, the EM algorithm is reduced, in each case, to solution of only one equation with one unknown parameter with the use of the n g values, i.e., the numbers of individuals after the corresponding regrouping of the data on the individuals" genotypes. Treatment of the data obtained by Kurbatova on the MNSs and Rhesus systems with alleles C, C  w , c, D, d, E, e with the use of Weir's EM algorithm and the EM algorithm suggested in this study yielded similar results. However, the MLEs of the parameters obtained with the use of either algorithm often converged to a wrong solution: the sum of the frequencies of all gametes (4 and 12 gametes for MNSs and Rhesus, respectively) was not equal to 1.0 even if the global maximum of LLF was reached for each of them (as it was for MNSs with the use of Weir's EM algorithm), with each parameter falling within admissible limits (e.g., 0, min(P N, P s) for P Ns). The χ2 function is suggested to be used as a goodness-of-fit function for the distribution of genotypes in a sample in order to select acceptable solutions. However, the minimum of this function only guarantee the acceptability of solutions if all limitations on the parameters are met: the sum of estimations of gametic frequencies is 1.0, each frequency falls within the admissible limits, and the “gametic algebra” is complied with (none of the frequencies is negative). http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Russian Journal of Genetics Springer Journals

# The Use of the Expectation–Maximization (EM) Algorithm for Maximum Likelihood Estimation of Gametic Frequencies of Multilocus Polymorphic Codominant Systems Based on Sampled Population Data

, Volume 38 (3) – Oct 13, 2004
11 pages

/lp/springer_journal/the-use-of-the-expectation-maximization-em-algorithm-for-maximum-mV0GL0tm65
Publisher
Springer Journals
Subject
Biomedicine; Human Genetics
ISSN
1022-7954
eISSN
1608-3369
D.O.I.
10.1023/A:1014867121080
Publisher site
See Article on Publisher Site

### Abstract

Estimation of gametic frequencies in multilocus polymorphic systems based on the numerical distribution of multilocus genotypes in a population sample (“analysis without pedigrees”) is difficult because some gametes are not recognized in the data obtained. Even in the case of codominant systems, where all alleles can be recognized by genotypes, so that direct estimation of the frequencies of genes (alleles) is possible (“complete data”), estimation of the frequencies of multilocus gametes based on the data on multilocus genotypes is sometimes impossible, whether population data or even family data are used for studying genotypic segregation or analysis of linkage (“incomplete data”). Such “incomplete data” are analyzed based on the corresponding genetic models using the expectation–maximization (EM) algorithm. In this study, the EM algorithm based on the random-marriage model for a nonsubdivided population was used to estimate gametic frequencies. The EM algorithm used in the study does not set any limitations on the number of loci and the number of alleles of each locus. Locus and alleles are identified by numeration making possible to arrange loops. In each combination of alleles for a given combination of m out of L loci (L is the total number of loci studied), all alleles are assigned value 1, and the remaining alleles are assigned value 0. The sum of zeros and unities for each gamete is its gametic value (h), and the sum of the gametic values of the gametes that form a given genotype is the genotypic value (g) of this genotype. Then, gametes with the sameh are united into a single class, which reduces the number of the estimated parameters. In a general case of m loci, this procedure yields m + 1 classes of gametes and 2m + 1 classes of genotypes with genotypic valuesg = 0, 1, 2,... 2m. The unknown frequencies of them + 1 classes of gametes can be represented as functions of the gametic frequencies whose maximum likelihood estimations (MLEs) have been obtained in all previous EM procedures and the only unknown frequency (P m(m)) that is to be estimated in the given EM procedure. At the expectation step, the expected frequencies (F m(g) of the genotypes with genotypic valuesg are expressed in terms of the products of the frequencies of m + 1 classes of gametes. The data on genotypes are the numbers (n g) of individuals with genotypic values g = 0, 1, 2, 3, ..., 2m. The maximization step is the maximization of the logarithm of the likelihood function (LLF) for n g values. Thus, the EM algorithm is reduced, in each case, to solution of only one equation with one unknown parameter with the use of the n g values, i.e., the numbers of individuals after the corresponding regrouping of the data on the individuals" genotypes. Treatment of the data obtained by Kurbatova on the MNSs and Rhesus systems with alleles C, C  w , c, D, d, E, e with the use of Weir's EM algorithm and the EM algorithm suggested in this study yielded similar results. However, the MLEs of the parameters obtained with the use of either algorithm often converged to a wrong solution: the sum of the frequencies of all gametes (4 and 12 gametes for MNSs and Rhesus, respectively) was not equal to 1.0 even if the global maximum of LLF was reached for each of them (as it was for MNSs with the use of Weir's EM algorithm), with each parameter falling within admissible limits (e.g., 0, min(P N, P s) for P Ns). The χ2 function is suggested to be used as a goodness-of-fit function for the distribution of genotypes in a sample in order to select acceptable solutions. However, the minimum of this function only guarantee the acceptability of solutions if all limitations on the parameters are met: the sum of estimations of gametic frequencies is 1.0, each frequency falls within the admissible limits, and the “gametic algebra” is complied with (none of the frequencies is negative).

### Journal

Russian Journal of GeneticsSpringer Journals

Published: Oct 13, 2004

## You’re reading a free preview. Subscribe to read the entire article.

### DeepDyve is your personal research library

It’s your single place to instantly
that matters to you.

over 18 million articles from more than
15,000 peer-reviewed journals.

All for just \$49/month

### Search

Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly

### Organize

Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.

### Access

Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.

### Your journals are on DeepDyve

Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.

All the latest content is available, no embargo periods.

DeepDyve

DeepDyve

### Pro

Price

FREE

\$49/month
\$360/year

Save searches from
PubMed

Create lists to

Export lists, citations

Abstract access only

18 million full-text articles

Print

20 pages / month

PDF Discount

20% off