Recent advances in Large Language Models (LLMs) have brought remarkable progress in code understanding and reasoning, creating new opportunities and...
Mikhail Terekhov, Alexander Panfilov, Daniil Dzenhaliou +4 more
AI control protocols serve as a defense mechanism to stop untrusted LLM agents from causing harm in autonomous settings. Prior work treats this as a...
In recent years, data poisoning attacks have been increasingly designed to appear harmless and even beneficial, often with the intention of verifying...
Large language models (LLMs) are now routinely used to autonomously execute complex tasks, from natural language processing to dynamic workflows like...
Milad Nasr, Nicholas Carlini, Chawin Sitawarin +11 more
How should we evaluate the robustness of language model defenses? Current defenses against jailbreaks and prompt injections (which aim to prevent an...
The safety of Large Language Models (LLMs) is crucial for the development of trustworthy AI applications. Existing red teaming methods often rely on...
Large Language Models (LLMs) have been augmented with web search to overcome the limitations of the static knowledge boundary by accessing up-to-date...
Christos Ziakas, Nicholas Loo, Nishita Jain +1 more
Automated red-teaming has emerged as a scalable approach for auditing Large Language Models (LLMs) prior to deployment, yet existing approaches lack...
This paper presents the vision, scientific contributions, and technical details of RedTWIZ: an adaptive and diverse multi-turn red teaming framework,...
Sri Durga Sai Sowmya Kadali, Evangelos E. Papalexakis
Jailbreaking large language models (LLMs) has emerged as a pressing concern with the increasing prevalence and accessibility of conversational LLMs....