Abstract:
We consider the multiarmed bandit problem (the problem of Markov bandits)
with switching penalties and no discounting in case when state spaces of all bandits are finite.
An optimal strategy should have the largest average reward per unit time on
an infinite time horizon.
For this problem it is shown that an optimal strategy can be specified by a Gittins index
under the natural assumption that the switching penalties are nonnegative.
Keywords:multicomponent systems, Gittins index,
simple family of alternative Markov bandit processes,
multiarmed bandit problem, Markov decision process, controlled Markov processes,
long run average return, no discounting, switching penalties,
optimal strategy.