Paper 2512.13741v1

The Laminar Flow Hypothesis: Detecting Jailbreaks via Semantic Turbulence in Large Language Models

Large Language Models (LLMs) become ubiquitous, the challenge of securing them against adversarial "jailbreaking" attacks has intensified. Current defense strategies often rely on computationally expensive external classifiers or brittle lexical

high relevance attack
Paper 2512.09403v1

Black-Box Behavioral Distillation Breaks Safety Alignment in Medical LLMs

harmful prompt generation, verifier filtering, category-wise failure analysis, and adaptive Random Search (RS) jailbreak attacks. We also propose a layered defense system, as a prototype detector for real-time

medium relevance defense
Paper 2603.18433v1

Prompt Control-Flow Integrity: A Priority-Aware Runtime Defense Against Prompt Injection in LLM Systems

often treat prompts as flat strings and rely on ad hoc filtering or static jailbreak detection. This paper proposes Prompt Control-Flow Integrity (PCFI), a priority-aware runtime defense that

high relevance tool
Paper 2510.19169v2

OpenGuardrails: A Configurable, Unified, and Scalable Guardrails Platform for Large Language Models

such as harmful or explicit text generation, (2) model-manipulation attacks including prompt injection, jailbreaks, and code-interpreter abuse, and (3) data leakage involving sensitive or private information. Unlike prior

medium relevance tool
Paper 2603.21354v1

The Workload-Router-Pool Architecture for LLM Inference Optimization: A Vision Paper from the vLLM Semantic Router Project

feedback-driven routing adaptation, hallucination detection, and hierarchical content-safety classification for privacy and jailbreak protection; (2) fleet optimization -- fleet provisioning and energy-efficiency analysis; (3) agentic and multimodal routing

medium relevance attack
Paper 2510.02609v2

RedCodeAgent: Automatic Red-teaming Agent against Diverse Code Agents

they fail to cover certain boundary conditions, such as the combined effects of different jailbreak tools. In this work, we propose RedCodeAgent, the first automated red-teaming agent designed

high relevance benchmark
Paper 2603.24511v1

Claudini: Autoresearch Discovers State-of-the-Art Adversarial Attack Algorithms for LLMs

LLM agents like Claude Code can not only write code but also be used for autonomous AI research and engineering \citep{rank2026posttrainbench, novikov2025alphaevolve}. We show that an \emph{autoresearch}-style

high relevance attack
Paper 2602.00388v1

A Fragile Guardrail: Diffusion LLM's Safety Blessing and Its Failure Mode

Diffusion large language models (D-LLMs) offer an alternative to

medium relevance defense
Paper 2510.20129v1

SAID: Empowering Large Language Models with Self-Activating Internal Defense

Large Language Models (LLMs), despite advances in safety alignment, remain vulnerable to jailbreak attacks designed to circumvent protective mechanisms. Prevailing defense strategies rely on external interventions, such as input filtering

medium relevance defense
Paper 2602.06630v1

TrapSuffix: Proactive Defense Against Adversarial Suffixes in Jailbreaking

inference pipeline. TrapSuffix channels jailbreak attempts into these two outcomes by reshaping the model's response landscape to adversarial suffixes. Across diverse suffix-based jailbreak settings, TrapSuffix reduces the average

high relevance attack
Paper 2511.17666v1

Evaluating Adversarial Vulnerabilities in Modern Large Language Models

determined by the generation of disallowed content, with successful jailbreaks assigned a severity score. The findings indicate a disparity in jailbreak susceptibility between 2.5 Flash and GPT-4, suggesting variations

medium relevance attack
Paper 2601.03273v1

GuardEval: A Multi-Perspective Benchmark for Evaluating Safety, Fairness, and Robustness in LLM Moderators

As large language models (LLMs) become deeply embedded in daily

medium relevance benchmark
Paper 2602.21236v1

@GrokSet: multi-party Human-LLM Interactions in Social Media

million tweets involving the @Grok LLM on X. Our analysis reveals a distinct functional shift: rather than serving as a general assistant, the LLM is frequently invoked as an authoritative

medium relevance benchmark
Paper 2511.19171v2

Can LLMs Threaten Human Survival? Benchmarking Potential Existential Threats from LLMs via Prefix Completion

safety evaluation of large language models (LLMs) has become extensive, driven by jailbreak studies that elicit unsafe responses. Such response involves information already available to humans, such as the answer

medium relevance benchmark
Paper 2602.01600v1

Expected Harm: Rethinking Safety Evaluation of (Mis)Aligned LLMs

Current evaluations of LLM safety predominantly rely on severity-based taxonomies to assess the harmfulness of malicious queries. We argue that this formulation requires re-examination as it assumes uniform

medium relevance benchmark
Paper 2510.02833v4

Attack via Overfitting: 10-shot Benign Fine-tuning to Jailbreak LLMs

highly susceptible to jailbreak attacks. Among these attacks, finetuning-based ones that compromise LLMs' safety alignment via fine-tuning stand out due to its stable jailbreak performance. In particular

high relevance attack
Paper 2510.25941v3

RECAP: Reproducing Copyrighted Data from LLMs Training with an Agentic Pipeline

cannot inspect the training data of a large language model (LLM), how can we ever know what it has seen? We believe the most compelling evidence arises when the model

medium relevance benchmark
Paper 2512.08967v1

CluCERT: Certifying LLM Robustness via Clustering-Guided Denoising Smoothing

repeated sampling. To address these limitations, we propose CluCERT, a novel framework for certifying LLM robustness via clustering-guided denoising smoothing. Specifically, to achieve tighter certified bounds, we introduce

medium relevance attack
Paper 2602.20170v1

CAGE: A Framework for Culturally Adaptive Red-Teaming Benchmark Generation

technical vulnerabilities rooted in local culture and law, creating a critical blind spot in LLM safety evaluation. To address this gap, we introduce CAGE (Culturally Adaptive Generation), a framework that

high relevance benchmark
Paper 2510.04885v1

RL Is a Hammer and LLMs Are Nails: A Simple Reinforcement Learning Recipe for Strong Prompt Injection

Prompt injection poses a serious threat to the reliability and safety of LLM agents. Recent defenses against prompt injection, such as Instruction Hierarchy and SecAlign, have shown notable robustness against

high relevance attack
Previous Page 8 of 10 Next