Abstract:
This article presents a review and comparative analysis of multimodal virtual environments for reinforcement learning. Seven different environments are considered, including the HomeGrid, BabyAI, RTFM, Messenger, Touchdown, Alfred, and IGLU, and research is focused on their peculiarities and requirements to agents. The main attention is paid to such parameters as complexity of text instructions and the dynamic properties of the environment. The conducted analysis identifies the strengths and weaknesses of each environment, which allows determining the optimal conditions for effective agent training, and also emphasizes the need to create more balanced environments combining high requirements to both understanding of language and interaction with the surrounding.
Keywords:multimodal learning, language grounding, reinforcement learning.