LLaVAShield: Safeguarding Multimodal Multi-Turn Dialogues in Vision-Language Models
Abstract
As Vision-Language Models (VLMs) move into interactive, multi-turn use, safety concerns intensify for multimodal multi-turn dialogue, which is characterized by concealment of malicious intent, contextual risk accumulation, and cross-modal joint risk. These characteristics limit the effectiveness of content moderation approaches designed for single-turn or single-modality settings. To address these limitations, we first construct the Multimodal Multi-turn Dialogue Safety (MMDS) dataset, comprising 4,484 annotated dialogues and a comprehensive risk taxonomy with 8 primary and 60 subdimensions. As part of MMDS construction, we introduce Multimodal Multi-turn Red Teaming (MMRT), an automated framework for generating unsafe multimodal multi-turn dialogues. We further propose LLaVAShield, which audits the safety of both user inputs and assistant responses under specified policy dimensions in multimodal multi-turn dialogues. Extensive experiments show that LLaVAShield significantly outperforms state-of-the-art VLMs and existing content moderation tools while demonstrating strong generalization and flexible policy adaptation. Additionally, we analyze vulnerabilities of mainstream VLMs to harmful inputs and evaluate the contribution of key components, advancing understanding of safety mechanisms in multimodal multi-turn dialogues.
Metadata
- Comment
- Accepted to CVPR 2026
Pro Analysis
Full threat analysis, ATLAS technique mapping, compliance impact assessment (ISO 42001, EU AI Act), and actionable recommendations are available with a Pro subscription.