Attack HIGH relevance

AJAR: Adaptive Jailbreak Architecture for Red-teaming

Yipu Dou Wang Yang
Published
January 16, 2026
Updated
March 19, 2026

Abstract

Large language model (LLM) safety evaluation is moving from content moderation to action security as modern systems gain persistent state, tool access, and autonomous control loops. Existing jailbreak frameworks still leave a gap between adaptive multi-turn attacks and agentic runtimes: attack algorithms are usually packaged as monolithic scripts, while agent harnesses rarely expose explicit abstractions for rollback, tool simulation, or strategy switching. We present AJAR, a red-teaming framework that exposes multi-turn jailbreak algorithms as callable MCP services and lets an Auditor Agent orchestrate them inside a tool-aware runtime built on Petri. AJAR integrates three representative attacks, namely Crescendo, ActorAttack, and X-Teaming, under a shared service interface for planning, prompt generation, optimization, evaluation, and context control. On 200 HarmBench validation behaviors, AJAR improves X-Teaming from 65.0% to 76.0% attack success rate (ASR), reaches 80% cumulative success one turn earlier than the native implementation, and reproduces Crescendo more effectively than PyRIT (91.0% vs. 87.5% ASR). Behavior-level analysis shows that these gains are concentrated in hard categories and frequently depend on rollback-enabled transcript repair. We further show that tool access reshapes rather than uniformly enlarges the attack surface: ActorAttack rises from 51.0% to 56.0% ASR with tools, whereas Crescendo drops from 91.0% to 78.0% and X-Teaming from 76.0% to 55.5%, with the sharpest declines appearing in categories that rely on long semantic buildup. These results position AJAR as a practical foundation for evaluating multi-turn jailbreaks under realistic agent constraints. Code and data are available at https://github.com/douyipu/ajar.

Metadata

Comment
7 pages, 3 figures. Code and data available at https://github.com/douyipu/ajar

Pro Analysis

Full threat analysis, ATLAS technique mapping, compliance impact assessment (ISO 42001, EU AI Act), and actionable recommendations are available with a Pro subscription.

Threat Deep-Dive
ATLAS Mapping
Compliance Reports
Actionable Recommendations
Start 14-Day Free Trial