I. S. Alekseevskaia, K. V. Arkhipenko, D. Yu. Turdakov, “Development of a red-teaming dataset for defending large language models against attacks”, Proceedings of ISP RAS, 2024, Volume 36, Issue 5,Pages <nobr>143

Development of a red-teaming dataset for defending large language models against attacks

I. S. Alekseevskaia^a, K. V. Arkhipenko^ba, D. Yu. Turdakov^ba

^a Ivannikov Institute for System Programming of the RAS
^b Lomonosov Moscow State University

Abstract: Modern large language models are huge systems with complex internal mechanisms implementing black-box response generation. Although aligned large language models have built-in defense mechanisms against attacks, recent studies demonstrate the vulnerability of large language models to attacks. In this study, we aim to expand the existing malicious datasets obtained from attacks so that similar vulnerabilities in large language models can be addressed in the future through the alignment procedure. In addition, we conduct experiments with modern large language models on our malicious dataset, which demonstrates the existing weaknesses in the models.

Keywords: large language models, jailbreak attacks, red-teaming datasets, trustworthy artificial intelligence.

DOI: 10.15514/ISPRAS-2024-36(5)-10