ISSN 0032-9460, Problems of Information Transmission, 2012, Vol. 48, No. 1, pp. 72–84.
Pleiades Publishing, Inc., 2012.
Original Russian Text
A.V. Kolnogorov, 2012, published in Problemy Peredachi Informatsii, 2012, Vol. 48, No. 1, pp. 83–95.
Two-Armed Bandit Problem
for Parallel Data Processing Systems
A. V. Kolnogorov
Applied Mathematics and Information Science Chair,
Yaroslav-the-Wise Novgorod State University
Received March 22, 2011; in ﬁnal form, September 19, 2011
Abstract—We consider application of the two-armed bandit problem to processing a large
number N of data where two alternative processing methods can be used. We propose a
strategy which at the ﬁrst stages, whose number is at most r −1, compares the methods, and at
the ﬁnal stage applies only the best one obtained from the comparison. We ﬁnd asymptotically
optimal parameters of the strategy and observe that the minimax risk is of the order of N
where α =2
− 1). Under parallel processing, the total operation time is determined by
the number r of stages but not by the number N of data.
Assume that there are two alternative methods for information transmission, for instance, using
diﬀerent error-protection coding, with numbers =1, 2. The number of N transmitted packets is
assumed to be large. Using the th method is successful (a packet is error-free transmitted) with
or unsuccessful (the packet is transmitted with errors) with probability q
independently of results of application to other data. The probabilities p
, =1, 2, are assumed
to be unknown. Thus, information transmission is described by a controlled random process ξ
=1ifpacketn was transmitted successfully and ξ
= 0 if there were
transmission errors. Values of the process are often interpreted as current gain.
The goal is maximizing the mathematical expectation
of the number of successfully trans-
mitted packets with the help of some strategy that uses current information on the process. This is
the problem of rational behavior in random environment [1–4], known also as the adaptive control
problem (adaptive choice problem) [5,6] or the two-armed bandit problem [7,8]. Various approaches
to its solution depend on possible applications. For example, for the description of rational behavior
of protozoa in random environment, models based on ﬁnite automata are used [1–4]. Larger gains
can be provided by using variable-structure automata , identiﬁcation algorithms , and recur-
sive algorithms . Under a Bayesian approach [7, 8], an optimal strategy can be found precisely,
but the choice of a prior distribution must be justiﬁed. If there are no criteria for choosing a prior
distribution, one can use the minimax approach considered below.
In the above-mentioned settings, parallel data processing is assumed. Note two peculiar features
of our setting: ﬁrst, information transmission methods cannot be changed too often, since such a
system would work slowly; second, operation can be made faster by using parallel processing. An
elementary example is as follows: at the beginning of the work, both methods are applied to large
enough groups of ν
packets. Then the result are compared, and the more eﬃcient method (with