RUS  ENG
Full version
JOURNALS // Doklady Rossijskoj Akademii Nauk. Mathematika, Informatika, Processy Upravlenia // Archive

Dokl. RAN. Math. Inf. Proc. Upr., 2024 Volume 520, Number 2, Pages 260–266 (Mi danma605)

SPECIAL ISSUE: ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING TECHNOLOGIES

MDS-ViTNet: Improving Saliency Prediction for Eye-Tracking with Vision Transformer

I. Polezhaevab, I. Goncharenkobc, N. Yurinac

a Yandex, Moscow, Russia
b Moscow Institute of Physics and Technology, Dolgoprudny, Moscow oblast, Russia
c Sber, Moscow, Russia

Abstract: In this paper, we present a novel methodology we call MDS-ViTNet (Multi Decoder Saliency by Vision Transformer Network) for enhancing visual saliency prediction or eye-tracking. This approach holds significant potential for diverse fields, including marketing, medicine, robotics, and retail. We propose a network architecture that leverages the Vision Transformer, moving beyond the conventional ImageNet backbone. The framework adopts an encoder-decoder structure, with the encoder utilizing a Swin transformer to efficiently embed most important features. This process involves a Transfer Learning method, wherein layers from the Vision Transformer are converted by the Encoder Transformer and seamlessly integrated into a CNN Decoder. This methodology ensures minimal information loss from the original input image. The decoder employs a multi-decoding technique, utilizing dual decoders to generate two distinct attention maps. These maps are subsequently combined into a singular output via an additional CNN model. Our trained model MDS-ViTNet achieves state-of-the-art results across several benchmarks. Committed to fostering further collaboration, we intend to make our code, models, and datasets accessible to the public.

UDC: 004.8

Received: 27.09.2024
Accepted: 02.10.2024

DOI: 10.31857/S2686954324700620


 English version:
Doklady Mathematics, 2024, 110:suppl. 1, S230–S235

Bibliographic databases:


© Steklov Math. Inst. of RAS, 2025