![]() |
|
SEMINARS |
|
How do multimodal large language neural networks see color? G. R. Lobarev Institute for Information Transmission Problems of the Russian Academy of Sciences (Kharkevich Institute), Moscow |
|||
Abstract: Modern multimodal models such as Qwen-VL, LLaVA, or GPT combine language and vision to "understand" the world closer to humans. But to what extent is this understanding really perceptual? Especially in such a subtle area as color: for a person, it is not an RGB code, but a more abstract feeling that depends on context, lighting, and even emotions. At the seminar, we will talk about how the color space inside the MLLM is arranged and compare it with the human psychophysical space: are visual encoders (vits) extracted sufficiently accurate representations? And most importantly, does the language block (LLM) make a correction that brings the perception of the model closer to the human one? We will present the results of the Qwen-VL embedding analysis based on classical psychophysical data — the Munsell scale, does "language really help to see". |