Problems of Information Transmission, Vol. 40, No. 2, 2004, pp. 168–174. Translated from Problemy Peredachi Informatsii, No. 2, 2004, pp. 73–80.
Original Russian Text Copyright
2004 by Nekrasov.
Probability-Theoretic and Statistical Analysis
of Dictionary Texts
S. A. Nekrasov
South-Russia State University of Technology, Novocherkassk
Received July 24, 2003; in ﬁnal form, January 30, 2004
Abstract—We consider probability-theoretic and statistical models and methods for comput-
ing the characteristics of dictionary structures. Results of a statistical analysis of several com-
monly used dictionaries are presented in order to test the adequacy of the computing methods
When dealing with problems of mathematical modeling of text structures, design of various
text processing systems, and in many other cases, one often needs to know the corresponding
probabilistic and statistical characteristics [1–4]. Such characteristics can be obtained either by the
simulation method or with the use of probability-theoretic models. As an adequacy criterion of the
models, various statistics can be applied.
In the paper, we consider the problem of modeling a dictionary structure, which is a sequence
of the form
where CW(k) is the kth catchword (CW) of an input language (counted from the beginning of the
dictionary) and Entry(k) is the corresponding dictionary entry (or a dictionary nest), consisting
of a ﬁnite sequence of terminal symbols of the input and output languages.
Deterministic forecasting of sizes of individual entries in real dictionaries is practically impos-
sible. For this reason, the basic eﬃcient computational method is probability-theoretic modeling
and statistical simulation. The main result of the analysis is a probability distribution law of the
vector of parameters that characterize the positions of entries.
Similar problems are solved or formulated in the theory of programming and databases [1, 2].
For example, in classical monograph , for solving a number of problems of analyzing computer
algorithms and information structures (e.g., for the algorithm of dynamic memory allocation),
both a rigorous probability-theoretic approach and simulation by the Monte Carlo method are
used. However, for the case of list structures (which corresponds to the case of dictionary texts
considered in the paper), only the simplest case of ﬁxed-length records is studied in .
In database design theory, one of the principal problems is forecasting the memory requirements
for storing variable-length records in tree structures (or compound lists) .
In the paper, we improve on the results of . In particular, to prove the adequacy of the
suggested model and solution method, we use statistical data for a wider set of dictionaries. Also,
we investigate the inﬂuence of the method of text formatting.
2004 MAIK “Nauka/Interperiodica”