RUS  ENG
Full version
JOURNALS // Doklady Rossijskoj Akademii Nauk. Mathematika, Informatika, Processy Upravlenia // Archive

Dokl. RAN. Math. Inf. Proc. Upr., 2025 Volume 527, Pages 449–458 (Mi danma700)

SPECIAL ISSUE: ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING TECHNOLOGIES

HAMSA: hijacking aligned compact models via stealthy automation

A. S. Krylovabc, I. I. Vagizovabc, D. S. Korzhdc, M. Douibae, A. Guezzaze, V. N. Kokhabc, S. D. Erokhindef, E. V. Tutubalinabcf, O. Y. Rogovdac

a Moscow Institute of Physics and Technology (National Research University), Dolgoprudny, Moscow Region
b Sberbank, Moscow
c Artificial Intelligence Research Institute, Moscow
d Moscow Technical University of Communications and Informatics
e Cadi Ayyad University, Casablanca, Morocco
f Kazan (Volga Region) Federal University

Abstract: Large Language Models (LLMs), especially their compact efficiency-oriented variants, remain susceptible to jailbreak attacks that can elicit harmful outputs despite extensive alignment efforts. Existing adversarial prompt generation techniques often rely on manual engineering or rudimentary obfuscation, producing low-quality or incoherent text that is easily flagged by perplexity-based filters. We present an automated red-teaming framework that evolves semantically meaningful and stealthy jailbreak prompts for aligned compact LLMs. The approach employs a multi-stage evolutionary search, where candidate prompts are iteratively refined using a population-based strategy augmented with temperature-controlled variability to balance exploration and coherence preservation. This enables the systematic discovery of prompts capable of bypassing alignment safeguards while maintaining natural language fluency. We evaluate our method on benchmarks in English (In-The-Wild Jailbreak Prompts on LLMs), and a newly curated Arabic one derived from In-The-Wild Jailbreak Prompts on LLMs and annotated by native Arabic linguists, enabling multilingual assessment.

Received: 15.08.2025
Accepted: 15.09.2025

DOI: 10.7868/S2686954325070380



Bibliographic databases:


© Steklov Math. Inst. of RAS, 2025