Quality & Quantity 34: 223–235, 2000.
© 2000 Kluwer Academic Publishers. Printed in the Netherlands.
A Probabilistic Clustering Model
for Variables of Mixed Type
Institute of Sociology, University of Erlangen-Nurenberg, D-90402 Nurenberg, Germany
Abstract. This paper develops a probabilistic clustering model for mixed data. The model allows
analysis of variables of mixed type: the variables may be nominal, ordinal and/or quantitative. The
model contains the well-known models of latent class analysis as submodels. As in latent class ana-
lysis, local independence of the variables is assumed. The parameters of the model are estimated by
the EM algorithm. Test statistics and goodness-of-ﬁt measures are proposed for model selection.
Two artiﬁcial data sets show the usefulness of these tests. An empirical example completes the
Key words: cluster analysis, latent class analysis, probabilistic clustering, variables of mixed type.
K-means clustering models are widely used for partitioning large data bases to
homogeneous clusters (Jain and Dubes, 1988: 90). However, k-means clustering
has certain disadvantages: (1) The variables must be commensurable (Fox, 1982).
This implies interval or ratio variables with equal scale units. (2) Each pattern
(case) is assigned deterministically to one and only one cluster. This may result
in biased estimators of the cluster means if the clusters overlap. (3) There is no
accepted statistical basis, even though a lot of approaches are now available (Bock,
1989; Bryant 1991; Jahnke 1988; Pollard 1981, 1982).
This paper develops a general probabilistic clustering model that overcomes
the problems of k-means clustering. Variables with different measurement levels
– nominal, ordinal and/or intervalor ratio (= quantitative variables) – and differ-
ent scale units can be analyzed without any transformation of the variables. Each
pattern is assigned probabilistically to the clusters. Hence, the model allows over-
lapping or fuzzy clustering. Finally, the model has a statistical basis, the maximum
1. The Model
The main idea of model is to use probabilities π(k/g) instead of distances d
as is the case in k-means clustering. π(k/g)is the probability that pattern g belongs