ISSN 0032-9460, Problems of Information Transmission, 2009, Vol. 45, No. 2, pp. 124–144.
Pleiades Publishing, Inc., 2009.
Original Russian Text
A.G. D’yachkov, A.N. Voronina, 2009, published in Problemy Peredachi Informatsii, 2009, Vol. 45, No. 2, pp. 56–77.
DNA Codes for Additive Stem Similarity
A. G. D’yachkov and A.N. Voronina
Probability Theory Chair, Faculty of Mechanics and Mathematics,
Lomonosov Moscow State University
Received September 16, 2008; in ﬁnal form, March 12, 2009
Abstract—We study two new concepts of combinatorial coding theory: additive stem similar-
ity and additive stem distance between q-ary sequences. For q = 4, the additive stem similarity
is applied to describe a mathematical model of thermodynamic similarity, which reﬂects the
“hybridization potential” of two DNA sequences. Codes based on the additive stem distance are
called DNA codes. We develop methods to prove upper and lower bounds on the rate of DNA
codes analogous to the well-known Plotkin upper bound and random coding lower bound (the
Gilbert–Varshamov bound). These methods take into account both the “Markovian” character
of the additive stem distance and the structure of a DNA code speciﬁed by its invariance under
the Watson–Crick transformation. In particular, our lower bound is established with the help
of an ensemble of random codes where distribution of independent codewords is deﬁned by a
stationary Markov chain.
For DNA sequences, we use the notation and terminology introduced in . Single DNA strands
(or DNA sequences) are treated as sequences over the alphabet (A, C, G, T ), with the four letters
denoting the corresponding nucleic acids (bases). DNA sequences are directed; thus, for instance,
the DNA sequence (AACG)isdistinctfrom(GCAA). The reverse complement (or Watson–Crick)
transformation of a DNA strand is deﬁned by ﬁrst reversing the order of the letters and then
substituting each letter for its complement, namely, A for T , C for G, and vice versa. For example,
the DNA strands (AACG)and(CGTT) are mutually reverse complement sequences.
DNA hybridization is the process of two oppositely directed single DNA strands coalescing
into a duplex by forming hydrogen bonds between complementary bases of the two DNA strands.
A duplex formed of two mutually reverse complement DNA strands is called a Watson–Crick duplex.
Cross hybridization of DNA strands occurs when a duplex is formed of two oppositely directed and
noncomplementary DNA strands. Cross hybridization does not necessarily occur but is possible.
The energy of DNA hybridization, i.e., the sum of energies of all formed hydrogen bonds, to a ﬁrst
approximation can be measured [1–3] by either the longest length of a common subsequence of
these two sequences or the “stem” similarity between the DNA strands, which can be calculated as
the number of common 2-blocks (stems) containing adjacent symbols in the same longest common
subsequence. One can easily see that the maximum energy of DNA hybridization is produced
when a Watson–Crick duplex is formed. Papers [2,4] develop a deﬁnition of energy that takes into
account weights of these stems.
In biological experiments engaging the DNA hybridization property, cross hybridization is gen-
erally undesirable, since it usually leads to experimental errors. There arises the problem of con-
structing the largest possible ensembles of single DNA strands such that cross hybridization in