ISSN 0032-9460, Problems of Information Transmission, 2011, Vol. 47, No. 1, pp. 28–33.
Pleiades Publishing, Inc., 2011.
Original Russian Text
M.A. Babenko, T.A. Starikovskaya, 2011, published in Problemy Peredachi Informatsii, 2011, Vol. 47, No. 1, pp. 33–39.
Computing the Longest Common Substring
with One Mismatch
M. A. Babenko and T. A. Starikovskaya
Chair of Mathematical Logic and Theory of Algorithms, Faculty of Mechanics and Mathematics,
Lomonosov Moscow State University
Received May 7, 2010; in ﬁnal form, August 24, 2010
Abstract—The paper describes an algorithm for computing longest common substrings of two
with one mismatch in O(|α
|) additional space. The
algorithm always scans symbols of α
sequentially, starting from the ﬁrst symbol.TheRAM
model of computation is used.
Computing the longest common substring is one of the basic problems in stringology. However,
for applications such as bioinformatics and text analysis, the problem of computing longest common
substrings with mismatches (insertions, deletions, and substitutions of symbols) is of much greater
importance. There are several approaches to this problem, for instance, dynamic programming.
A solution based on dynamic programming computes longest common approximate substrings using
time and space proportional to the product of strings’ lengths .
The algorithm described in the paper has the same running time and computes the longest
common substrings of two strings with one mismatch. Additional space used by the algorithm is
proportional to the least of strings’ lengths.
The most general form of the problem in question is as follows.
Given two strings α
, compute substrings γ
, respectively, that
diﬀer in at most one symbol and have maximum lengths.
Let us assume that the length of α
is suﬃciently larger than that of α
. The considered
algorithm makes several passes over α
, but each time reads the string sequentially from the left to
the right, starting from the ﬁrst symbol. This condition is essential for applications that store α
in external memory, since random access becomes time-expensive in this case.
| is denoted by n
| is denoted by n
). The running time of the
presented algorithm is O(n
), and the additional memory used by the algorithm is O(|α
Preliminaries. We assume that a ﬁnite nonempty set (alphabet ) Σ is ﬁxed. Elements of this
set are called letters or symbols. A ﬁnite ordered sequence of letters (possibly empty) is called a
We use Greek letters to denote strings. Letters in a string are numbered starting from 1; for
example, letters of α (which is of length k) are denoted by α,...,α[k]. The length k of α is
denoted by |α|.Thesubstring of α from position i to position j (inclusive) is denoted by α[i : j].
Supported in part by the Russian Foundation for Basic Research, project no. 09-01-00709-a.