# The Use of the Expectation–Maximization (EM) Algorithm for Maximum Likelihood Estimation of Gametic Frequencies of Multilocus Polymorphic Codominant Systems Based on Sampled Population Data

The Use of the Expectation–Maximization (EM) Algorithm for Maximum Likelihood Estimation of... Estimation of gametic frequencies in multilocus polymorphic systems based on the numerical distribution of multilocus genotypes in a population sample (“analysis without pedigrees”) is difficult because some gametes are not recognized in the data obtained. Even in the case of codominant systems, where all alleles can be recognized by genotypes, so that direct estimation of the frequencies of genes (alleles) is possible (“complete data”), estimation of the frequencies of multilocus gametes based on the data on multilocus genotypes is sometimes impossible, whether population data or even family data are used for studying genotypic segregation or analysis of linkage (“incomplete data”). Such “incomplete data” are analyzed based on the corresponding genetic models using the expectation–maximization (EM) algorithm. In this study, the EM algorithm based on the random-marriage model for a nonsubdivided population was used to estimate gametic frequencies. The EM algorithm used in the study does not set any limitations on the number of loci and the number of alleles of each locus. Locus and alleles are identified by numeration making possible to arrange loops. In each combination of alleles for a given combination of m out of L loci (L is the total number of loci studied), all alleles are assigned value 1, and the remaining alleles are assigned value 0. The sum of zeros and unities for each gamete is its gametic value (h), and the sum of the gametic values of the gametes that form a given genotype is the genotypic value (g) of this genotype. Then, gametes with the sameh are united into a single class, which reduces the number of the estimated parameters. In a general case of m loci, this procedure yields m + 1 classes of gametes and 2m + 1 classes of genotypes with genotypic valuesg = 0, 1, 2,... 2m. The unknown frequencies of them + 1 classes of gametes can be represented as functions of the gametic frequencies whose maximum likelihood estimations (MLEs) have been obtained in all previous EM procedures and the only unknown frequency (P m(m)) that is to be estimated in the given EM procedure. At the expectation step, the expected frequencies (F m(g) of the genotypes with genotypic valuesg are expressed in terms of the products of the frequencies of m + 1 classes of gametes. The data on genotypes are the numbers (n g) of individuals with genotypic values g = 0, 1, 2, 3, ..., 2m. The maximization step is the maximization of the logarithm of the likelihood function (LLF) for n g values. Thus, the EM algorithm is reduced, in each case, to solution of only one equation with one unknown parameter with the use of the n g values, i.e., the numbers of individuals after the corresponding regrouping of the data on the individuals" genotypes. Treatment of the data obtained by Kurbatova on the MNSs and Rhesus systems with alleles C, C  w , c, D, d, E, e with the use of Weir's EM algorithm and the EM algorithm suggested in this study yielded similar results. However, the MLEs of the parameters obtained with the use of either algorithm often converged to a wrong solution: the sum of the frequencies of all gametes (4 and 12 gametes for MNSs and Rhesus, respectively) was not equal to 1.0 even if the global maximum of LLF was reached for each of them (as it was for MNSs with the use of Weir's EM algorithm), with each parameter falling within admissible limits (e.g., 0, min(P N, P s) for P Ns). The χ2 function is suggested to be used as a goodness-of-fit function for the distribution of genotypes in a sample in order to select acceptable solutions. However, the minimum of this function only guarantee the acceptability of solutions if all limitations on the parameters are met: the sum of estimations of gametic frequencies is 1.0, each frequency falls within the admissible limits, and the “gametic algebra” is complied with (none of the frequencies is negative). http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Russian Journal of Genetics Springer Journals

# The Use of the Expectation–Maximization (EM) Algorithm for Maximum Likelihood Estimation of Gametic Frequencies of Multilocus Polymorphic Codominant Systems Based on Sampled Population Data

, Volume 38 (3) – Oct 13, 2004
11 pages

/lp/springer_journal/the-use-of-the-expectation-maximization-em-algorithm-for-maximum-mV0GL0tm65
Publisher
Subject
Biomedicine; Human Genetics
ISSN
1022-7954
eISSN
1608-3369
D.O.I.
10.1023/A:1014867121080
Publisher site
See Article on Publisher Site

### Abstract

Estimation of gametic frequencies in multilocus polymorphic systems based on the numerical distribution of multilocus genotypes in a population sample (“analysis without pedigrees”) is difficult because some gametes are not recognized in the data obtained. Even in the case of codominant systems, where all alleles can be recognized by genotypes, so that direct estimation of the frequencies of genes (alleles) is possible (“complete data”), estimation of the frequencies of multilocus gametes based on the data on multilocus genotypes is sometimes impossible, whether population data or even family data are used for studying genotypic segregation or analysis of linkage (“incomplete data”). Such “incomplete data” are analyzed based on the corresponding genetic models using the expectation–maximization (EM) algorithm. In this study, the EM algorithm based on the random-marriage model for a nonsubdivided population was used to estimate gametic frequencies. The EM algorithm used in the study does not set any limitations on the number of loci and the number of alleles of each locus. Locus and alleles are identified by numeration making possible to arrange loops. In each combination of alleles for a given combination of m out of L loci (L is the total number of loci studied), all alleles are assigned value 1, and the remaining alleles are assigned value 0. The sum of zeros and unities for each gamete is its gametic value (h), and the sum of the gametic values of the gametes that form a given genotype is the genotypic value (g) of this genotype. Then, gametes with the sameh are united into a single class, which reduces the number of the estimated parameters. In a general case of m loci, this procedure yields m + 1 classes of gametes and 2m + 1 classes of genotypes with genotypic valuesg = 0, 1, 2,... 2m. The unknown frequencies of them + 1 classes of gametes can be represented as functions of the gametic frequencies whose maximum likelihood estimations (MLEs) have been obtained in all previous EM procedures and the only unknown frequency (P m(m)) that is to be estimated in the given EM procedure. At the expectation step, the expected frequencies (F m(g) of the genotypes with genotypic valuesg are expressed in terms of the products of the frequencies of m + 1 classes of gametes. The data on genotypes are the numbers (n g) of individuals with genotypic values g = 0, 1, 2, 3, ..., 2m. The maximization step is the maximization of the logarithm of the likelihood function (LLF) for n g values. Thus, the EM algorithm is reduced, in each case, to solution of only one equation with one unknown parameter with the use of the n g values, i.e., the numbers of individuals after the corresponding regrouping of the data on the individuals" genotypes. Treatment of the data obtained by Kurbatova on the MNSs and Rhesus systems with alleles C, C  w , c, D, d, E, e with the use of Weir's EM algorithm and the EM algorithm suggested in this study yielded similar results. However, the MLEs of the parameters obtained with the use of either algorithm often converged to a wrong solution: the sum of the frequencies of all gametes (4 and 12 gametes for MNSs and Rhesus, respectively) was not equal to 1.0 even if the global maximum of LLF was reached for each of them (as it was for MNSs with the use of Weir's EM algorithm), with each parameter falling within admissible limits (e.g., 0, min(P N, P s) for P Ns). The χ2 function is suggested to be used as a goodness-of-fit function for the distribution of genotypes in a sample in order to select acceptable solutions. However, the minimum of this function only guarantee the acceptability of solutions if all limitations on the parameters are met: the sum of estimations of gametic frequencies is 1.0, each frequency falls within the admissible limits, and the “gametic algebra” is complied with (none of the frequencies is negative).

### Journal

Russian Journal of GeneticsSpringer Journals

Published: Oct 13, 2004

## You’re reading a free preview. Subscribe to read the entire article.

### DeepDyve is your personal research library

It’s your single place to instantly
that matters to you.

over 12 million articles from more than
10,000 peer-reviewed journals.

All for just \$49/month

### Stay up to date

It’s easy to organize your research with our built-in tools.

### Your journals are on DeepDyve

Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.

All the latest content is available, no embargo periods.

### Monthly Plan

• Personalized recommendations
• No expiration
• Print 20 pages per month
• 20% off on PDF purchases

\$49/month

14-day Free Trial

Best Deal — 39% off

### Annual Plan

• All the features of the Professional Plan, but for 39% off!
• Billed annually
• No expiration
• For the normal price of 10 articles elsewhere, you get one full year of unlimited access to articles.

\$588

\$360/year

billed annually

14-day Free Trial