P2P: A Poison-to-Poison Remedy for Reliable Backdoor Defense in LLMs
algorithm. P2P injects benign triggers with safe alternative labels into a subset of training samples and fine-tunes the model on this re-poisoned dataset by leveraging prompt-based learning
Can Transformer Memory Be Corrupted? Investigating Cache-Side Vulnerabilities in Large Language Models
prompts and parameters are secured, transformer language models remain vulnerable because their key-value (KV) cache during inference constitutes an overlooked attack surface. This paper introduces Malicious Token Injection
SGHA-Attack: Semantic-Guided Hierarchical Alignment for Transferable Targeted Attacks on Vision-Language Models
reference pool by sampling a frozen text-to-image model conditioned on the target prompt, and then carefully select the Top-K most semantically relevant anchors under the surrogate
iSeal: Encrypted Fingerprinting for Reliable LLM Ownership Verification
role in addressing this challenge. Existing LLM fingerprinting methods verify ownership by extracting or injecting model-specific features. However, they overlook potential attacks during the verification process, leaving them ineffective
Deep Research Brings Deeper Harm
agents. To address this gap, we propose two novel jailbreak strategies: Plan Injection, which injects malicious sub-goals into the agent's plan; and Intent Hijack, which reframes harmful queries
Measuring and Exploiting Confirmation Bias in LLM-Assisted Security Code Review
across four state-of-the-art models under five framing conditions for the review prompt. Framing a change as bug-free reduces vulnerability detection rates by 16-93%, with strongly
RAG Security and Privacy: Formalizing the Threat Model and Attack Surface
demonstrated that LLMs can leak sensitive information through training data memorization or adversarial prompts, and RAG systems inherit many of these vulnerabilities. At the same time, reliance
Toward Honest Language Models for Deductive Reasoning
cases by randomly perturbing an edge in half of the instances. We find that prompting and existing training methods, including GRPO with or without supervised fine-tuning initialization, struggle
Red-Teaming Claude Opus and ChatGPT-based Security Advisors for Trusted Execution Environments
system, yet real deployments remain vulnerable to microarchitectural leakage, side-channel attacks, and fault injection. In parallel, security teams increasingly rely on Large Language Model (LLM) assistants as security advisors
ChartAttack: Testing the Vulnerability of LLMs to Malicious Prompting in Chart Generation
Multimodal large language models (MLLMs) are increasingly used to automate
Memory Poisoning Attack and Defense on Memory Based LLM-Agents
memory and influence future responses. Recent work demonstrated that the MINJA (Memory Injection Attack) achieves over 95 % injection success rate and 70 % attack success rate under idealized conditions. However
Open WebUI Affected by an External Model Server (Direct Connections
TFL: Targeted Bit-Flip Attack on Large Language Model
safety and security critical applications, raising concerns about their robustness to model parameter fault injection attacks. Recent studies have shown that bit-flip attacks (BFAs), which exploit computer main memory
Fragile Thoughts: How Large Language Models Handle Chain-of-Thought Perturbations
Chain-of-Thought (CoT) prompting has emerged as a foundational technique for eliciting reasoning from Large Language Models (LLMs), yet the robustness of this approach to corruptions in intermediate reasoning
OI-Bench: An Option Injection Benchmark for Evaluating LLM Susceptibility to Directive Interference
signals such as social cues, framing, and instructions. In this work, we introduce option injection, a benchmarking approach that augments the multiple-choice question answering (MCQA) interface with an additional
Whisper Leak: a side-channel attack on Large Language Models
paramount. This paper introduces Whisper Leak, a side-channel attack that infers user prompt topics from encrypted LLM traffic by analyzing packet size and timing patterns in streaming responses. Despite
Has the Two-Decade-Old Prophecy Come True? Artificial Bad Intelligence Triggered by Merely a Single-Bit Flip in Large Language Models
Recently, Bit-Flip Attack (BFA) has garnered widespread attention for