Paper 2603.04459v2

Benchmark of Benchmarks: Unpacking Influence and Code Repository Quality in LLM Safety Benchmarks

based on both automated and human assessment) on LLM safety benchmarks, analyzing 31 benchmarks and 382 non-benchmarks across prompt injection, jailbreak, and hallucination. We find that benchmark papers show

medium relevance benchmark
Paper 2510.11837v1

Countermind: A Multi-Layered Security Architecture for Large Language Models

security of Large Language Model (LLM) applications is fundamentally challenged by "form-first" attacks like prompt injection and jailbreaking, where malicious instructions are embedded within user inputs. Conventional defenses, which

medium relevance benchmark
Paper 2601.03265v1

Jailbreak-Zero: A Path to Pareto Optimal Red Teaming for Large Language Models

This paper introduces Jailbreak-Zero, a novel red teaming methodology that shifts the paradigm of Large Language Model (LLM) safety evaluation from a constrained example-based approach to a more

high relevance attack
Paper 2510.10281v1

ArtPerception: ASCII Art-based Jailbreak on LLMs with Recognition Pre-test

more nuanced evaluation of an LLM's recognition capability. Through comprehensive experiments on four SOTA open-source LLMs, we demonstrate superior jailbreak performance. We further validate our framework's real

high relevance attack
Paper 2603.19127v1

On Optimizing Multimodal Jailbreaks for Spoken Language Models

inherit the safety vulnerabilities of their LLM backbone and an expanded attack surface. SLMs have been previously shown to be susceptible to jailbreaking, where adversarial prompts induce harmful responses

high relevance attack
Paper 2601.09625v2

The Promptware Kill Chain: How Prompt Injections Gradually Evolved Into a Multistep Malware Delivery Mechanism

prompts engineered to exploit an application's LLM. We introduce a seven-stage promptware kill chain: Initial Access (prompt injection), Privilege Escalation (jailbreaking), Reconnaissance, Persistence (memory and retrieval poisoning), Command

high relevance attack
Paper 2601.02377v1

Trust in LLM-controlled Robotics: a Survey of Security Threats, Defenses and Challenges

landscape and corresponding defense strategies for LLM-controlled robotics. Specifically, we discuss a comprehensive taxonomy of attack vectors, covering topics such as jailbreaking, backdoor attacks, and multi-modal prompt injection

medium relevance survey
Paper 2601.10141v1

Understanding and Preserving Safety in Fine-Tuned LLMs

both deep fine-tuning and dynamic jailbreak attacks. Together, our findings provide new mechanistic understanding and practical guidance toward always-aligned LLM fine-tuning

medium relevance defense
Paper 2512.01353v3

The Trojan Knowledge: Bypassing Commercial LLM Guardrails via Harmless Prompt Weaving and Adaptive Tree Search

Large language models (LLMs) remain vulnerable to jailbreak attacks that bypass safety guardrails to elicit harmful outputs. Existing approaches overwhelmingly operate within the prompt-optimization paradigm: whether through traditional algorithmic

medium relevance defense
Paper 2602.16977v1

Fail-Closed Alignment for Large Language Models

independent refusal directions that prompt-based jailbreaks cannot suppress simultaneously, providing empirical support for fail-closed alignment as a principled foundation for robust LLM safety

medium relevance defense
Paper 2601.12460v1

TrojanPraise: Jailbreak LLMs via Benign Fine-Tuning

word to praise harmful concepts, subtly shifting the LLM from refusal to compliance. To explain the attack, we decouple the LLM's internal representation of a query into two dimensions

high relevance attack
Paper 2510.21983v1

Uncovering the Persuasive Fingerprint of LLMs in Jailbreaking Attacks

safeguards, demonstrating their potential to induce jailbreak behaviors. This work underscores the importance of cross-disciplinary insight in addressing the evolving challenges of LLM safety. The code and data

high relevance attack
Paper 2601.05445v1

Knowledge-Driven Multi-Turn Jailbreaking on Large Language Models

fail to adapt to the LLM's dynamic and unpredictable conversational state. To address these shortcomings, we introduce Mastermind, a multi-turn jailbreak framework that adopts a dynamic and self

high relevance attack
Paper 2510.08646v2

Mitigating Over-Refusal in Aligned Large Language Models via Inference-Time Activation Energy

undesirable states (false refusal or jailbreak) and low energy to desirable states (helpful response or safe reject). During inference, the EBM maps the LLM's internal activations to an energy

medium relevance benchmark
Paper 2603.15417v1

Amplification Effects in Test-Time Reinforcement Learning: Safety and Reasoning Vulnerabilities

force the model to answer jailbreak and reasoning queries together, resulting in stronger harmfulness amplification. Overall, our results highlight that TTT methods that enhance LLM reasoning by promoting self-consistency

medium relevance defense
Paper 2603.01414v1

Jailbreaking Embodied LLMs via Action-level Manipulation

than iterative trial-and-error jailbreaking of black-box embodied LLMs, Blindfold adopts an Adversarial Proxy Planning strategy: it compromises a local surrogate LLM to perform action-level manipulations that

high relevance attack
Paper 2602.16943v1

Mind the GAP: Text Safety Does Not Transfer to Tool-Call Safety in LLM Agents

tool-call-level safety in LLM agents. We test six frontier models across six regulated domains (pharmaceutical, financial, educational, employment, legal, and infrastructure), seven jailbreak scenarios per domain, three system

medium relevance tool
Paper 2510.14207v2

Echoes of Human Malice in Agents: Benchmarking LLMs for Multi-Turn Online Harassment Attacks

Large Language Model (LLM) agents are powering a growing share of interactive web applications, yet remain vulnerable to misuse and harm. Prior jailbreak research has largely focused on single-turn

high relevance benchmark
Paper 2602.13234v1

Stay in Character, Stay Safe: Dual-Cycle Adversarial Self-Evolution for Safety Role-Playing Agents

LLM-based role-playing has rapidly improved in fidelity, yet stronger adherence to persona constraints commonly increases vulnerability to jailbreak attacks, especially for risky or negative personas. Most prior work

medium relevance attack
Paper 2511.15304v3

Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in Large Language Models

ensemble of 3 open-weight LLM judges, whose binary safety assessments were validated on a stratified human-labeled subset. Poetic framing achieved an average jailbreak success rate

high relevance attack
Previous Page 5 of 10 Next