Abstract:
Modern large language models are huge systems with complex internal mechanisms implementing black-box response generation. Although aligned large language models have built-in defense mechanisms against attacks, recent studies demonstrate the vulnerability of large language models to attacks. In this study, we aim to expand the existing malicious datasets obtained from attacks so that similar vulnerabilities in large language models can be addressed in the future through the alignment procedure. In addition, we conduct experiments with modern large language models on our malicious dataset, which demonstrates the existing weaknesses in the models.
Keywords:large language models, jailbreak attacks, red-teaming datasets, trustworthy artificial intelligence.