Search: LLM jailbreak | AI Threat Intelligence

Paper 2510.09023v1

2025-10-10

The Attacker Moves Second: Stronger Adaptive Attacks Bypass Defenses Against Llm Jailbreaks and Prompt Injections

How should we evaluate the robustness of language model defenses

high relevance attack

Paper 2601.03600v1

2026-01-07

ALERT: Zero-shot LLM Jailbreak Detection via Internal Discrepancy Amplification

Despite rich safety alignment strategies, large language models (LLMs) remain

high relevance attack

Paper 2602.05444v2

2026-02-05

Causal Front-Door Adjustment for Robust Jailbreak Attacks on LLMs

causal perspective. Then, we propose the Causal Front-Door Adjustment Attack (CFA{$^2$}) to jailbreak LLM, which is a framework that leverages Pearl's Front-Door Criterion to sever

high relevance attack

Paper 2510.22628v1

2025-10-26

Sentra-Guard: A Multilingual Human-AI Framework for Real-Time Defense Against Adversarial LLM Jailbreaks

This paper presents a real-time modular defense system named

high relevance tool

Paper 2509.23558v1

2025-09-28

Formalization Driven LLM Prompt Jailbreaking via Reinforcement Learning

challenges. For instance, prompt jailbreaking attacks involve adversaries crafting sophisticated prompts to elicit responses from LLMs that deviate from human values. To uncover vulnerabilities in LLM alignment methods, we propose

high relevance attack

Paper 2512.18755v1

2025-12-21

MEEA: Mere Exposure Effect-Driven Confrontational Optimization for LLM Jailbreaking

optimizes them using a simulated annealing strategy guided by semantic similarity, toxicity, and jailbreak effectiveness. Extensive experiments on both closed-source and open-source models, including GPT-4, Claude

high relevance attack

Paper 2510.26096v1

2025-10-30

ALMGuard: Safety Shortcuts and Where to Find Them as Guardrails for Audio-Language Models

defenses directly transferred from traditional audio adversarial attacks or text-based Large Language Model (LLM) jailbreaks are largely ineffective against these ALM-specific threats. To address this issue, we propose

medium relevance defense

Paper 2509.25624v2

2025-09-30

STAC: When Innocent Tools Form Dangerous Chains to Jailbreak LLM Agents

As LLMs advance into autonomous agents with tool-use capabilities

high relevance tool

Paper 2511.02356v1

2025-11-04

An Automated Framework for Strategy Discovery, Retrieval, and Evolution in LLM Jailbreak Attacks

The widespread deployment of Large Language Models (LLMs) as public

high relevance tool

Paper 2511.19517v2

2025-11-24

Automating Deception: Scalable Multi-Turn LLM Jailbreaks

paper introduces a novel, automated pipeline for generating large-scale, psychologically-grounded multi-turn jailbreak datasets. We systematically operationalize FITD techniques into reproducible templates, creating a benchmark

high relevance attack

Paper 2511.13788v2

2025-11-16

Scaling Patterns in Adversarial Alignment: Evidence from Multi-LLM Jailbreak Experiments

jailbreak smaller ones - eliciting harmful or restricted behavior despite alignment safeguards. Using standardized adversarial tasks from JailbreakBench, we simulate over 6,000 multi-turn attacker-target exchanges across major LLM

high relevance attack

Paper 2512.05485v2

2025-12-05

TeleAI-Safety: A comprehensive LLM jailbreaking benchmark towards attacks, defenses, and evaluations

high-value industries continues to expand, the systematic assessment of their safety against jailbreak and prompt-based attacks remains insufficient. Existing safety evaluation benchmarks and frameworks are often limited

high relevance benchmark

Paper 2511.12782v1

2025-11-16

LLM Reinforcement in Context

adversarial attacks and misbehavior by training on examples and prompting. Research has shown that LLM jailbreak probability increases with the size of the user input or conversation length. There

medium relevance attack

Paper 2511.00346v1

2025-11-01

Exploiting Latent Space Discontinuities for Building Universal LLM Jailbreaks and Data Extraction Attacks

The rapid proliferation of Large Language Models (LLMs) has raised

high relevance attack

Paper 2511.04316v1

2025-11-06

AdversariaLLM: A Unified and Modular Toolbox for LLM Robustness Research

hindering meaningful progress. To address these issues, we introduce AdversariaLLM, a toolbox for conducting LLM jailbreak robustness research. Its design centers on reproducibility, correctness, and extensibility. The framework implements twelve

medium relevance tool

Paper 2601.05742v1

2026-01-09

The Echo Chamber Multi-Turn LLM Jailbreak

The availability of Large Language Models (LLMs) has led to

high relevance attack

Paper 2512.20405v2

2025-12-23

ChatGPT: Excellent Paper! Accept It. Editor: Imposter Found! Review Rejected

author can inject hidden prompts inside a PDF that secretly guide or "jailbreak" LLM reviewers into giving overly positive feedback and biased acceptance. On the defense side, we propose

medium relevance survey

Paper 2511.12217v1

2025-11-15

AlignTree: Efficient Defense Against LLM Jailbreak Attacks

Large Language Models (LLMs) are vulnerable to adversarial attacks that

high relevance attack

Paper 2510.13901v2

2025-10-14

RAID: Refusal-Aware and Integrated Decoding for Jailbreaking LLMs

baselines. These findings highlight the importance of embedding-space regularization for understanding and mitigating LLM jailbreak vulnerabilities

high relevance attack

Paper 2510.09471v1

2025-10-10

Getting Your Indices in a Row: Full-Text Search for LLM Training Data for Real World

Finally, we demonstrate that such indices can be used to ensure previously inaccessible jailbreak-agnostic LLM safety. We hope that our findings will be useful to other teams attempting large

medium relevance benchmark