This article is cited in
1 paper
$Q$-learning in a stochastic Stackelberg game between an uninformed leader and a naive follower
D. B. Rokhlin Institute of Mathematics, Mechanics and Computer Sciences, Southern Federal University
Abstract:
We consider a game between a leader and a follower, where the players' actions affect the stochastic dynamics of the state process
$x_t$,
$t\in\mathbb Z_+$. The players observe their rewards and the system state
$x_t$. The transition kernel of the process
$x_t$ and the opponent rewards are unobservable. At each stage of the game, the leader selects action
$a_t$ first. When selecting the action
$b_t$, the follower knows the action
$a_t$. The follower's actions are unknown to the leader (an uniformed leader). Each player tries to maximize the discounted criterion by applying the
$Q$-learning algorithm. The players' randomized strategies are uniquely determined by Boltzmann distributions depending on the
$Q$-functions, which are updated in the course of learning. The specific feature of the algorithm is that when updating the
$Q$-function, the follower believes that the action of the leader in the next state is the same as in the current one (a naive follower). It is shown that the convergence of the algorithm is secured by the existence of deterministic stationary strategies that generate an irreducible Markov chain. The limiting large time behavior of the players'
$Q$-functions is described in terms of controlled Markov processes. The distributions of the players' actions converge to Boltzmann distributions depending on the limiting
$Q$-functions.
Keywords:
$Q$-learning, leader, follower, stochastic Stackelberg game, discounted criterion, Boltzmann distribution. Received: 18.06.2018
Revised: 12.10.2018
Accepted: 18.10.2018
DOI:
10.4213/tvp5231