Abstract:
Text-to-image models use user-generated prompts to produce images. Such text-to-image models as DALL-E 2, Imagen, Stable Diffusion, and Midjourney can generate photorealistic or similar to human-drawn images. Apart from imitating human art, large text-to-image models have learned to produce combinations of pixels reminiscent of captions in natural languages. For example, a generated image might contain a figure of an animal and a symbol combination reminding us of human-readable words in a natural language describing the biological name of this species. Although the words occasionally appearing on generated images can be human-readable, they are not rooted in natural language vocabularies and make no sense to non-linguists. At the same time, we find that semiotic and linguistic analysis of the so-called hidden vocabulary of text-to-image models will contribute to the field of explainable AI and prompt engineering. We can use the results of this analysis to reduce the risks of applying such models in real life problem solving and to detect deepfakes. The proposed study is one of the first attempts at analyzing text-to-image models from the point of view of semiotics and linguistics. Our approach implies prompt engineering, image generation, and comparative analysis. The source code, generated images, and prompts have been made available at https://github.com/vifirsanova/text-to-image-explainable.
Key words and phrases:explainable artificial intelligence, text-to-image synthesis, diffusion models.