Hidden State Poisoning Attacks against Mamba-based Language Models
their hidden states, referred to as a Hidden State Poisoning Attack (HiSPA). Our benchmark RoBench25 allows evaluating a model's information retrieval capabilities when subject to HiSPAs, and confirms
SteganoBackdoor: Stealthy and Data-Efficient Backdoor Attacks on Language Models
Modern language models remain vulnerable to backdoor attacks via poisoned data, where training inputs containing a trigger are paired with a target output, causing the model to reproduce that behavior
Virus Infection Attack on LLMs: Your Poisoning Can Spread "VIA" Synthetic Data
data poisoning and backdoor attacks show that VIA significantly increases the presence of poisoning content in synthetic data and correspondingly raises the attack success rate (ASR) on downstream models
Adaptive and Robust Data Poisoning Detection and Sanitization in Wearable IoT Systems using Large Language Models
environments. This work proposes a novel framework that uses large language models (LLMs) to perform poisoning detection and sanitization in HAR systems, utilizing zero-shot, one-shot, and few-shot
Hidden State Poisoning Attacks against Mamba-based Language Models
their hidden states, referred to as a Hidden State Poisoning Attack (HiSPA). Our benchmark RoBench-25 allows evaluating a model's information retrieval capabilities when subject to HiSPAs, and confirms
The 'Sure' Trap: Multi-Scale Poisoning Analysis of Stealthy Compliance-Only Backdoors in Fine-Tuned Large Language Models
conduct a multi-scale analysis of this benign-label poisoning behavior across poison budget, total fine-tuning dataset size, and model size. A sharp threshold appears at small absolute budgets
Confundo: Learning to Generate Robust Poison for Practical RAG Systems
present Confundo, a learning-to-poison framework that fine-tunes a large language model as a poison generator to achieve high effectiveness, robustness, and stealthiness. Confundo provides a unified framework
Cost-Minimized Label-Flipping Poisoning Attack to LLM Alignment
preserving the intended poisoning effect. Empirical results demonstrate that this cost-minimization post-processing can significantly reduce poisoning costs over baselines, particularly when the reward model's feature dimension
Self-Purification Mitigates Backdoors in Multimodal Diffusion Language Models
induced behaviors and restore normal functionality. Building on this, we purify the poisoned dataset using the compromised model itself, then fine-tune the model on the purified data to recover
Merging Triggers, Breaking Backdoors: Defensive Poisoning for Instruction-Tuned Language Models
backdoor attacks, where adversaries poison a small subset of data to implant hidden behaviors. Despite this growing risk, defenses for instruction-tuned models remain underexplored. We propose MB-Defense (Merging
Why LoRA Fails to Forget: Regularized Low-Rank Adaptation Against Backdoors in Language Models
large language models, but it is notably ineffective at removing backdoor behaviors from poisoned pretrained models when fine-tuning on clean dataset. Contrary to the common belief that this weakness
Multi-Agent Framework for Threat Mitigation and Resilience in AI-Based Systems
making them targets for data poisoning, model extraction, prompt injection, automated jailbreaking, and preference-guided black-box attacks that exploit model comparisons. Larger models can be more vulnerable to introspection
Silent Sabotage During Fine-Tuning: Few-Shot Rationale Poisoning of Compact Medical LLMs
poisoning attack targeting the reasoning process of medical LLMs during SFT. Unlike backdoor attacks, our method injects poisoned rationales into few-shot training data, leading to stealthy degradation of model
From Poisoned to Aware: Fostering Backdoor Self-Awareness in LLMs
triggers responsible for misaligned outputs. Guided by curated reward signals, this process transforms a poisoned model into one capable of precisely identifying its implanted trigger. Surprisingly, we observe that such
AntiFLipper: A Secure and Efficient Defense Against Label-Flipping Attacks in Federated Learning
remains vulnerable to label-flipping attacks, where malicious clients manipulate labels to poison the global model. Despite their simplicity, these attacks can severely degrade model performance, and defending against them
Unveiling the Security Risks of Federated Learning in the Wild: From Research to Practice
perspective. We systematize three major sources of mismatch between research and practice: unrealistic poisoning threat models, the omission of hybrid heterogeneity, and incomplete metrics that overemphasize peak attack success while
AutoBackdoor: Automating Backdoor Attacks via LLM Agents
model fine-tuning via an autonomous agent-driven pipeline. Unlike prior approaches, AutoBackdoor uses a powerful language model agent to generate semantically coherent, context-aware trigger phrases, enabling scalable poisoning
Sleeper Cell: Injecting Latent Malice Temporal Backdoors into Tool-Using LLMs
model generates benign textual responses immediately after destructive actions. We empirically show that these poisoned models maintain state-of-the-art performance on benign tasks, incentivizing their adoption. Our findings
Rounding-Guided Backdoor Injection in Deep Learning Model Quantization
quantization to embed malicious behaviors. Unlike conventional backdoor attacks relying on training data poisoning or model training manipulation, QuRA solely works using the quantization operations. In particular, QuRA first employs
The Trigger in the Haystack: Extracting and Reconstructing LLM Backdoor Triggers
Detecting whether a model has been poisoned is a longstanding problem in AI security. In this work, we present a practical scanner for identifying sleeper agent-style backdoors in causal