RUS  ENG
Full version
JOURNALS // Zapiski Nauchnykh Seminarov POMI // Archive

Zap. Nauchn. Sem. POMI, 2024 Volume 540, Pages 178–193 (Mi znsl7550)

An opensource library for AutoML multimodal clustering on Apache Spark

S. Muravyova, V. Kazakovtsevb, I. Usova, P. Shpinevaa, O. Muravyovaa, A. Shalytoa

a ITMO University, St. Petersburg, Russia
b Siberian Federal University, Krasnoyarsk, Russia

Abstract: We present a library that allows to choose and configure the clustering algorithm for multimodal datasets, i.e., for data where every object is stored not as a single vector but can be presented as a vector, text, and an image at the same time, and every modality is significant. Our library automatically finds a tradeoff between exploration and exploitation for the input data among a set of implemented clustering algorithms according to the selected internal clustering validation index. The library also implements a recommender system for the internal validation index and can predict the best fitting measure for the input data. We used Apache Spark to implement clustering algorithms, thus, it can be used on distributed computing system to clusterize big multimodal data.

Key words and phrases: automatic machine learning, multimodal models, clustering, Apache Spark.

Received: 15.11.2024

Language: English



© Steklov Math. Inst. of RAS, 2025