197 results in 63ms
Paper 2509.22067v2

The Rogue Scalpel: Activation Steering Compromises LLM Safety

Activation steering is a promising technique for controlling LLM behavior by adding semantically meaningful vectors directly into a model's hidden states during inference. It is often framed

medium relevance defense
Paper 2509.20230v3

Beyond Sharp Minima: Robust LLM Unlearning via Feedback-Guided Multi-Point Optimization

Current LLM unlearning methods face a critical security vulnerability that

medium relevance benchmark
Paper 2603.15397v1

SFCoT: Safer Chain-of-Thought via Active Safety Evaluation and Calibration

have demonstrated remarkable capabilities in complex reasoning tasks. However, they remain highly susceptible to jailbreak attacks that undermine their safety alignment. Existing defense mechanisms typically rely on post hoc filtering

medium relevance benchmark
Paper 2602.16935v1

DeepContext: Stateful Real-Time Detection of Multi-Turn Adversarial Intent Drift in LLMs

While Large Language Model (LLM) capabilities have scaled, safety guardrails remain largely stateless, treating multi-turn dialogues as a series of disconnected events. This lack of temporal awareness facilitates

medium relevance attack
Paper 2602.06911v1

TamperBench: Systematically Stress-Testing LLM Safety Under Fine-Tuning and Tampering

As increasingly capable open-weight large language models (LLMs) are

medium relevance tool
Paper 2602.04581v1

Trust The Typical

Current approaches to LLM safety fundamentally rely on a brittle cat-and-mouse game of identifying and blocking known threats via guardrails. We argue for a fresh approach: robust safety

medium relevance benchmark
Paper 2601.22636v2

Statistical Estimation of Adversarial Risk in Large Language Models under Best-of-N Sampling

propose a scaling-aware Best-of-N estimation of risk, SABER, for modeling jailbreak vulnerability under Best-of-N sampling. We model sample-level success probabilities using a Beta distribution

medium relevance attack
Paper 2601.17003v1

Beyond Simulations: What 20,000 Real Conversations Reveal About Mental Health AI Safety

Large language models (LLMs) are increasingly used for mental health

medium relevance defense
Paper 2601.03868v2

What Matters For Safety Alignment?

This paper presents a comprehensive empirical study on the safety

medium relevance defense
Paper 2601.02680v1

Adversarial Contrastive Learning for LLM Quantization Attacks

Model quantization is critical for deploying large language models (LLMs

high relevance attack
Paper 2601.00454v1

Defensive M2S: Training Guardrail Models on Compressed Multi-turn Conversations

Guardrail models are essential for ensuring the safety of Large Language Model (LLM) deployments, but processing full multi-turn conversation histories incurs significant computational cost. We propose Defensive

medium relevance defense
Paper 2511.22047v1

Evaluating the Robustness of Large Language Model Safety Guardrails Against Adversarial Attacks

Large Language Model (LLM) safety guardrail models have emerged as a primary defense mechanism against harmful content generation, yet their robustness against sophisticated adversarial attacks remains poorly characterized. This study

high relevance attack
Paper 2511.00203v1

Diffusion LLMs are Natural Adversaries for any LLM

We introduce a novel framework that transforms the resource-intensive

medium relevance attack
Paper 2511.04694v4

Reasoning Up the Instruction Ladder for Controllable Language Models

As large language model (LLM) based systems take on high

medium relevance benchmark
Paper 2510.18541v1

Pay Attention to the Triggers: Constructing Backdoors That Survive Distillation

transfer onto student models. Our key insight is that this is because existing LLM backdooring methods choose trigger tokens that rarely occur in usual contexts. We argue that this underestimates

medium relevance benchmark
Paper 2510.03520v1

Certifiable Safe RLHF: Fixed-Penalty Constraint Optimization for Safer Language Models

provable safety guarantee for a fixed dual variable that can be exploitable through adversarial jailbreaks. To overcome these limitations, we introduce Certifiable Safe-RLHF (CS-RLHF) that introduces a cost

medium relevance benchmark
Paper 2510.01088v1

Safety Instincts: LLMs Learn to Trust Their Internal Compass for Self-Defense

Ensuring Large Language Model (LLM) safety remains challenging due to the absence of universal standards and reliable content validators, making it difficult to obtain effective training signals. We discover that

medium relevance defense
Previous Page 10 of 10