I. Polezhaev, I. Goncharenko, N. Yurina, “MDS-ViTNet: Improving Saliency Prediction for Eye-Tracking with Vision Transformer”, Dokl. RAN. Math. Inf. Proc. Upr., 2024, Volume 520, Number 2,Pages <nobr>260

SPECIAL ISSUE: ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING TECHNOLOGIES

MDS-ViTNet: Improving Saliency Prediction for Eye-Tracking with Vision Transformer

I. Polezhaev^ab, I. Goncharenko^bc, N. Yurina^c

^a Yandex, Moscow, Russia
^b Moscow Institute of Physics and Technology, Dolgoprudny, Moscow oblast, Russia
^c Sber, Moscow, Russia

Abstract: In this paper, we present a novel methodology we call MDS-ViTNet (Multi Decoder Saliency by Vision Transformer Network) for enhancing visual saliency prediction or eye-tracking. This approach holds significant potential for diverse fields, including marketing, medicine, robotics, and retail. We propose a network architecture that leverages the Vision Transformer, moving beyond the conventional ImageNet backbone. The framework adopts an encoder-decoder structure, with the encoder utilizing a Swin transformer to efficiently embed most important features. This process involves a Transfer Learning method, wherein layers from the Vision Transformer are converted by the Encoder Transformer and seamlessly integrated into a CNN Decoder. This methodology ensures minimal information loss from the original input image. The decoder employs a multi-decoding technique, utilizing dual decoders to generate two distinct attention maps. These maps are subsequently combined into a singular output via an additional CNN model. Our trained model MDS-ViTNet achieves state-of-the-art results across several benchmarks. Committed to fostering further collaboration, we intend to make our code, models, and datasets accessible to the public.

UDC: 004.8

Received: 27.09.2024
Accepted: 02.10.2024

DOI: 10.31857/S2686954324700620