MemoryGraft: Persistent Compromise of LLM Agents via Poisoned Experience Retrieval
Large Language Model (LLM) agents increasingly rely on long-term memory and Retrieval-Augmented Generation (RAG) to persist experiences and refine future performance. While this experience learning capability enhances agentic
Jailbreaking Large Language Models through Iterative Tool-Disguised Attacks via Reinforcement Learning
ulti-step \underline{P}rogre\underline{s}sive \underline{T}ool-disguised Jailbreak Attack), a novel adaptive jailbreak method that synergistically exploits vulnerabilities in current defense mechanisms. iMIST disguises malicious
ProMoral-Bench: Evaluating Prompting Strategies for Moral Reasoning and Safety in LLMs
models.We introduce ProMoral-Bench, a unified benchmark evaluating 11 prompting paradigms across four LLM families. Using ETHICS, Scruples, WildJailbreak, and our new robustness test, ETHICS-Contrast, we measure performance
A Coin Flip for Safety: LLM Judges Fail to Reliably Measure Adversarial Robustness
Automated \enquote{LLM-as-a-Judge} frameworks have become the de facto standard for scalable evaluation across natural language processing. For instance, in safety evaluation, these judges are relied upon
Persona Jailbreaking in Large Language Models
Large Language Models (LLMs) are increasingly deployed in domains such
Hiding in Plain Text: Detecting Concealed Jailbreaks via Activation Disentanglement
Large language models (LLMs) remain vulnerable to jailbreak prompts that are fluent and semantically coherent, and therefore difficult to detect with standard heuristics. A particularly challenging failure mode occurs when
Dual-Space Smoothness for Robust and Balanced LLM Unlearning
With the rapid advancement of large language models, Machine Unlearning
SecureBreak -- A dataset towards safe and secure models
reinforced by the growing body of scientific literature showing that attacks, such as jailbreaking and prompt injection, can bypass existing security alignment mechanisms. As a consequence, additional security strategies
MetaDefense: Defending Finetuning-based Jailbreak Attack Before and During Generation
This paper introduces MetaDefense, a novel framework for defending against finetuning-based jailbreak attacks in large language models (LLMs). We observe that existing defense mechanisms fail to generalize to harmful
Auto-Tuning Safety Guardrails for Black-Box Large Language Models
three public benchmarks covering malware generation, classic jailbreak prompts, and benign user queries. Each configuration is scored using malware and jailbreak attack success rate, benign harmful-response rate
Risk-Adjusted Harm Scoring for Automated Red Teaming for LLMs in Financial Services
through legally or professionally plausible framing. We propose a risk-aware evaluation framework for LLM security failures in Banking, Financial Services, and Insurance (BFSI), combining a domain-specific taxonomy
SEMA: Simple yet Effective Learning for Multi-Turn Jailbreak Attacks
prompts while maintaining the same harmful objective. We anchor harmful intent in multi-turn jailbreaks via an intent-drift-aware reward that combines intent alignment, compliance risk, and level
Recursive language models for jailbreak detection: a procedural defense for tool-augmented agents
Jailbreak prompts are a practical and evolving threat to large language models (LLMs), particularly in agentic systems that execute tools over untrusted content. Many attacks exploit long-context hiding, semantic
Black-box Optimization of LLM Outputs by Asking for Directions
We present a novel approach for attacking black-box large
Helpful to a Fault: Measuring Illicit Assistance in Multi-Turn, Multilingual LLM Agents
LLM-based agents execute real-world workflows via tools and memory. These affordances enable ill-intended adversaries to also use these agents to carry out complex misuse scenarios. Existing agent
Metaphor-based Jailbreaking Attacks on Text-to-Image Models
models commonly incorporate defense mechanisms to prevent the generation of sensitive images. Unfortunately, recent jailbreaking attacks have shown that adversarial prompts can effectively bypass these mechanisms and induce T2I models
Differential Harm Propensity in Personalized LLM Agents: The Curious Case of Mental Health Disclosure
Large language models (LLMs) are increasingly deployed as tool-using
Bits Leaked per Query: Information-Theoretic Bounds on Adversarial Attacks against LLMs
Adversarial attacks by malicious users that threaten the safety of
A Multilingual, Large-Scale Study of the Interplay between LLM Safeguards, Personalisation, and Disinformation
Large Language Models (LLMs) can generate human-like disinformation, yet
Fewer Weights, More Problems: A Practical Attack on LLM Pruning
pruning remain underexplored. In this work, for the first time, we show that modern LLM pruning methods can be maliciously exploited. In particular, an adversary can construct a model that