LLaVAShield: Safeguarding Multimodal Multi-Turn Dialogues in Vision-Language Models

Guolei Huang^1,5,*,§ Qinzhi Peng^2,5,*,§ Gan Xu^3,5,§ Yao Huang^4,5 Yuxuan Lu^5,†,§ Yongjun Shen^1,†

¹Southeast University ²University of California, Santa Cruz
³Zhejiang University of Technology ⁴Tsinghua University ⁵RealAI

^*Equal contribution.
^†Corresponding authors.
^§Guolei Huang, Qinzhi Peng and Gan Xu conducted their internships at RealAI. During their internships, most of the work was completed in close collaboration with Yuxuan Lu, who served as the project lead.

Paper Dataset 🤗 Model Code

Overall Workflow. (a) Generate malicious intents and retrieve the associated images along with their textual descriptions; (b) MMDS dataset generation pipeline; (c) LLaVAShield audits the safety of both user inputs and assistant responses under the specified policy dimensions in multimodal multi-turn dialogue.

MMDS Test Set Results — Comparison of different models on the MMDS test set. LLaVAShield delivers state-of-the-art performance on content moderation for multimodal multi-turn dialogue and expose substantial shortcomings of current advanced VLMs and content moderation tools in the multimodal multi-turn setting.

Abstract

As Vision-Language Models (VLMs) move into interactive, multi-turn use, safety concerns intensify for multimodal multi-turn dialogue, which is characterized by concealment of malicious intent, contextual risk accumulation, and cross-modal joint risk. These characteristics limit the effectiveness of content moderation approaches designed for single-turn or single-modality settings. To address these limitations, we first construct the Multimodal Multi-turn Dialogue Safety (MMDS) dataset, comprising 4,484 annotated dialogues and a comprehensive risk taxonomy with 8 primary and 60 subdimensions. As part of MMDS construction, we introduce Multimodal Multi-turn Red Teaming (MMRT), an automated framework for generating unsafe multimodal multi-turn dialogues. We further propose LLaVAShield, which audits the safety of both user inputs and assistant responses under specified policy dimensions in multimodal multi-turn dialogues. Extensive experiments show that LLaVAShield significantly outperforms state-of-the-art VLMs and existing content moderation tools while demonstrating strong generalization and flexible policy adaptation. Additionally, we analyze vulnerabilities of mainstream VLMs to harmful inputs and evaluate the contribution of key components, advancing understanding of safety mechanisms in multimodal multi-turn dialogues.

Three Risk Characteristics in Multimodal Multi-Turn Dialogue

(1) Concealment of malicious intent

In multi-turn dialogues, attackers often begin with harmless openings and gradually escalate while deferring their true intent to evade detection. In multimodal settings, they further split the objective into dispersed textual and visual cues that, once linked across turns, substantially amplify harm and increase attack success rates.

(2) Contextual risk accumulation

In multi-turn dialogues, risk accumulates over the interaction: attackers decompose the end goal across turns and exploit the model’s reliance on early “local compliance,” widening the attack surface and steering the assistant along the existing context.

(3) Cross-modal joint risk

Multimodal multi-turn dialogues require the assistant to reason over images and text jointly; yet gaps in cross-modal safety alignment persist, making such joint risks a systemic weak point.

**Example of an unsafe multimodal multi-turn dialogue.** The illustration shows three risk characteristics in multimodal multi-turn dialogue. The shade of the orange text indicates the degree of concealment and harmfulness of the user's malicious intent, while the shade of the red text indicates the risk level of the AI assistant's responses.

Overview

MMDS Dataset. We introduce MMDS, the first dataset for content moderation in multimodal multi-turn dialogues. It contains 4,484 carefully annotated dialogues and adopts a comprehensive safety-risk taxonomy covering 8 primary and 60 subdimensions. During dataset construction, we also develop MMRT, an automated framework that efficiently generate unsafe multimodal multi-turn dialogues. It systematically simulates cross-turn, cross-modal coordinated attacks and efficiently explores unsafe dialogue paths in multimodal multi-turn settings.

LLaVAShield. We propose LLaVAShield, which audits the safety of both user inputs and assistant responses under the specified policy dimensions in multimodal multi-turn dialogues. Extensive experiments show that LLaVAShield significantly outperforms state-of-the-art VLMs and existing content moderation tools, while demonstrating strong generalization and flexible policy adaptation.

Vulnerabilities of mainstream VLMs. We further analyze the vulnerabilities of mainstream VLMs to harmful inputs in multimodal multi-turn dialogues and evaluate the contributions of key components, thereby advancing understanding of safety mechanisms in this setting.

Results

Main Result. (1) Advanced VLMs and content moderation tools struggle with composite safety risks in multimodal multi-turn dialogue. Results in Table 1 show that both open- and closed-source models perform poorly when harmful cases combine concealment of malicious intent, contextual risk accumulation, and cross-modal joint risk. (2) LLaVAShield sets a new SOTA for content moderation in multimodal multi-turn dialogue. On the MMDS test set, LLaVAShield attains F1 95.71% on the user side and F1 92.24% on the assistant side. (3) LLaVAShield’s advantage holds at fine-grained policy dimensions. To further analyze model performance, we conduct a fine-grained policy-dimension analysis in Table 2. We find that the performance advantage is not due to memorizing specific risks but reflects consistent understanding across diverse scenarios. Compared with GPT-5-mini, LLaVAShield achieves higher F1 on nearly all dimensions, with large margins on O2 (+39.11% user side) and O4 (+40.04% user side), which require complex context and cross-modal reasoning.

Table 2: Comparison between LLaVAShield and GPT-5-mini across policy dimensions on the MMDS test set. F1 (%).

Performance on External Safety Benchmarks. We assess LLaVAShield on MM-SafetyBench and VLGuard-Test, comparing against advanced VLMs and content moderation tools. The evaluation focuses on recall of unsafe adversarial cases on MM-SafetyBench and discrimination between safe and unsafe inputs on VLGuard-Test. As shown in Table 3, LLaVAShield consistently leads on both benchmarks and maintains advantages across most metrics.

Table 3: Results on external safety benchmarks. MM-SafetyBench is reported using Recall. All numbers are in %.

Performance Under Flexible Policy Adaptation. We evaluate LLaVAShield under changes to the policy configuration using 50 dialogues from the MMDS test set that both GPT-5-mini and LLaVAShield originally labeled Unsafe on the user side and the assistant side. We report false positive rate (FPR) as the primary metric; lower FPR indicates more accurate acceptance of compliant content and less excessive moderation. Results show that LLaVAShield achieves a 0% FPR on both the user and assistant sides, whereas GPT-5-mini records 30% and 34%, respectively. Overall, this demonstrates LLaVAShield’s strong performance in adapting to changing policy configurations.

VLM Vulnerabilities and Analysis of Component Contribution. We evaluate the seven mainstream VLMs using the MMRT framework on 60 malicious intents, where each task corresponds to a sub-dimension of our safety taxonomy. Results in Table 4 show consistently high Attack Success Rate (ASR), underscoring that mainstream VLMs remain vulnerable to harmful inputs in multimodal multi-turn dialogues and supporting that the MMRT framework effectively explores attack paths that elicit unsafe responses. We further analyze how images and dialogue turns affect risk. including images increases Average Score Gain by 0.375 and VLMs become more likely to produce harmful content as dialogues progress, as shown in Figure 1.

Table 4: ASR on the test set for different target models.

BibTeX

@misc{huang2025llavashield,
      title={LLaVAShield: Safeguarding Multimodal Multi-Turn Dialogues in Vision-Language Models}, 
      author={Guolei Huang and Qinzhi Peng and Gan Xu and Yao Huang and Yuxuan Lu and Yongjun Shen},
      year={2025},
      eprint={2509.25896},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}