Problems of Information Transmission, Vol. 37, No. 2, 2001, pp. 172–184. Translated from Problemy Peredachi Informatsii, No. 2, 2001, pp. 96–109.
Original Russian Text Copyright
2001 by Kukushkina, Polikarpov, Khmelev.
Using Literal and Grammatical Statistics
for Authorship Attribution
O. V. Kukushkina, A. A. Polikarpov, and D. V. Khmelev
Received August 8, 2000; in ﬁnal form, January 11, 2001
Abstract—Markov chains are used as a formal mathematical model for sequences of elements
of a text. This model is applied for authorship attribution of texts. As elements of a text, we
consider sequences of letters or sequences of grammatical classes of words. It turns out that
the frequencies of occurrences of letter pairs and pairs of grammatical classes in a Russian text
are rather stable characteristics of an author and, apparently, they could be used in disputed
authorship attribution. A comparison of results for various modiﬁcations of the method using
both letters and grammatical classes is given. Experimental research involves 385 texts of 82
writers. In the Appendix, the research of D.V. Khmelev is described, where data compression
algorithms are applied to authorship attribution.
In this paper, the problem of identiﬁcation of the author of a text is stated as follows. Let large
fragments of prose works by a number of authors be given. These texts are in Russian or in another
phonological (nonhieroglyphic) language
. An anonymous text known to belong to one of these
authors is disputed between all of them. One has to determine the actual author. Based on results
of testing the technique suggested by D.V. Khmelev in , we state that this can be done with
high enough probability. This technique is based on statistics of occurrences of pairs of successive
elements in the text (letters, morphemes, etc).
To the best of our knowledge, ﬁrst attempts in the search for a technique for authorship attribu-
tion were made in . Markov  almost immediately replied to , which shows that the founder
of the theory of Markov chains was quite interested in this ﬁeld. Note also that the ﬁrst application
of “events linked to a chain” was described by Markov in , where he studied the distribution of
vowels and consonants among initial 20 000 letters of “Evgenii Onegin.”
Modern methods of authorship attribution in Russia are reviewed in [5, Chapter 1] and a nice
review of foreign works is given in . Despite the enormous variety of methods described, none of
them has ever been applied to a large number of texts. The reason is that usually these methods
cannot be automated and require human interposal, which makes computational analysis of a large
number of large texts practically impossible. Hence, the question of generality of each of these
methods arises: Can any of them be used outside the situation it was devised for?
Until recently, the only exception was , where the chosen technique was applied to a large
enough number of texts. In this paper, the rate of function words used by an author was examined.
It was found that this rate is stable for each author among a large number of Russian writers of the
eighteenth–twentieth centuries. This technique was applied in  to the problem of determining
We consider phonological writing only because hieroglyphic writing reduces possibilities for the analysis
of pair associations since phonological information (which distinguishes morphemes and words) is hidden
by conventional hieroglyphic representation of these units.
2001 MAIK “Nauka/Interperiodica”