Yu. F. Sapronov, N. E. Yudin, “Optimal approximation of average reward Markov decision processes”, Ж. вычисл. матем. и матем. физ., 2025, том 65, номер 3,страницы 325

Оптимальное управление

Optimal approximation of average reward Markov decision processes

Yu. F. Sapronov^abc, N. E. Yudin^abcdef

^a Moscow Institute of Physics and Technology, 141701, Dolgoprudny, Russia
^b Higher School of Economics University, 109028, Moscow, Russia
^c Innopolis University, 420500, Innopolis, Russia
^d Institute for Information Transmission Problems (Kharkevich Institute), Russian Academy of Sciences, 127051, Moscow, Russia
^e Ivannikov Institute for System Programming, Russian Academy of Sciences, 109004, Moscow, Russia
^f Federal Research Center "Computer Science and Control" of Russian Academy of Sciences, 119333, Moscow, Russia

Аннотация: We continue to develop the concept of studying the $\varepsilon$-optimal policy for Average Reward Markov Decision Processes (AMDP) by reducing it to Discounted Markov Decision Processes (DMDP). Existing research often stipulates that the discount factor must not fall below a certain threshold. Typically, this threshold is close to one, and as is well-known, iterative methods used to find the optimal policy for DMDP become less effective as the discount factor approaches this value. Our work distinguishes itself from existing studies by allowing for inaccuracies in solving the empirical Bellman equation. Despite this, we have managed to maintain the sample complexity that aligns with the latest results. We have succeeded in separating the contributions from the inaccuracy of approximating the transition matrix and the residuals in solving the Bellman equation in the upper estimate so that our findings enable us to determine the total complexity of the epsilon-optimal policy analysis for DMDP across any method with a theoretical foundation in iterative complexity.

Ключевые слова: Markov decision processes, reinforcement learning, sample complexity, iteration complexity, ICOMP2024.

УДК: 519.87

Поступила в редакцию: 06.11.2024
Исправленный вариант: 06.11.2024
Принята в печать: 13.12.2024

Язык публикации: английский

DOI: 10.31857/S0044466925030074