Benchmark LOW
Chengwei Wei, Jung-jae Kim, Longyin Zhang +2 more
Large Language Models (LLMs) with extended reasoning capabilities often generate verbose and redundant reasoning traces, incurring unnecessary...
1 weeks ago cs.AI cs.CL
PDF
Benchmark LOW
Min Zeng, Shuang Zhou, Zaifu Zhan +1 more
Medical language models must be updated as evidence and terminology evolve, yet sequential updating can trigger catastrophic forgetting. Although...
Benchmark MEDIUM
Caglar Yildirim
Large language models (LLMs) are increasingly deployed as tool-using agents, shifting safety concerns from harmful text generation to harmful task...
Benchmark MEDIUM
Gengxin Sun, Ruihao Yu, Liangyi Yin +3 more
Ensuring robust and fair interview assessment remains a key challenge in AI-driven evaluation. This paper presents CoMAI, a general-purpose...
1 weeks ago cs.MA cs.AI
PDF
Benchmark LOW
Lingyu Li, Yan Teng, Yingchun Wang
Existing behavioral alignment techniques for Large Language Models (LLMs) often neglect the discrepancy between surface compliance and internal...
1 weeks ago cs.CL cs.AI
PDF
Benchmark LOW
Trishita Dhara, Siddhesh Sheth
Large language models are increasingly deployed in settings where relevant information is embedded within long and noisy contexts. Despite this,...
Benchmark LOW
Simone Aonzo, Merve Sahin, Aurélien Francillon +1 more
Artificial intelligence (AI) systems are increasingly adopted as tool-using agents that can plan, observe their environment, and take actions over...
1 weeks ago cs.CR cs.AI
PDF
Benchmark LOW
Taeyun Roh, Wonjune Jang, Junha Jung +1 more
Large language model agents heavily rely on external memory to support knowledge reuse and complex reasoning tasks. Yet most memory systems store...
1 weeks ago cs.CL cs.AI
PDF
Benchmark MEDIUM
Yu Pan, Wenlong Yu, Tiejun Wu +4 more
Large language models (LLMs) have demonstrated remarkable capabilities in complex reasoning tasks. However, they remain highly susceptible to...
1 weeks ago cs.CR cs.AI
PDF
Benchmark MEDIUM
Ye Wang, Jing Liu, Toshiaki Koike-Akino
The safety and reliability of vision-language models (VLMs) are a crucial part of deploying trustworthy agentic AI systems. However, VLMs remain...
1 weeks ago cs.LG cs.AI cs.CL
PDF
Benchmark MEDIUM
Yuhuan Liu, Haitian Zhong, Xinyuan Xia +3 more
Large Language Models (LLMs) often suffer from catastrophic forgetting and collapse during sequential knowledge editing. This vulnerability stems...
Benchmark MEDIUM
Jinhu Qi, Yifan Li, Minghao Zhao +4 more
As agentic AI systems move beyond static question answering into open-ended, tool-augmented, and multi-step real-world workflows, their increased...
1 weeks ago cs.CL cs.DB
PDF
Benchmark HIGH
Lidor Erez, Omer Hofman, Tamir Nizri +1 more
Automated LLM vulnerability scanners are increasingly used to assess security risks by measuring different attack type success rates (ASR). Yet the...
1 weeks ago cs.CR cs.PF
PDF
Benchmark LOW
Andrew Seohwan Yu, Mohsen Hariri, Kunio Nakamura +3 more
Vision language models (VLMs) have shown significant promise in visual grounding for images as well as videos. In medical imaging research, VLMs...
1 weeks ago cs.CV cs.LG
PDF
Benchmark LOW
Andrew Seohwan Yu, Mohsen Hariri, Kunio Nakamura +3 more
Vision language models (VLMs) have shown significant promise in visual grounding for images as well as videos. In medical imaging research, VLMs...
1 weeks ago cs.CV cs.LG
PDF
Benchmark LOW
Xiaoya Lu, Yijin Zhou, Zeren Chen +6 more
Vision-Language Models (VLMs) empower embodied agents to execute complex instructions, yet they remain vulnerable to contextual safety risks where...
Benchmark MEDIUM
Ivan Lopez, Selin S. Everett, Bryan J. Bunning +10 more
Large language models (LLMs) are entering clinician workflows, yet evaluations rarely measure how clinician reasoning shapes model behavior during...
1 weeks ago cs.HC cs.LG
PDF
Benchmark MEDIUM
Arjun Chakraborty, Sandra Ho, Adam Cook +1 more
CTI-REALM (Cyber Threat Real World Evaluation and LLM Benchmarking) is a benchmark designed to evaluate AI agents' ability to interpret cyber threat...
Benchmark LOW
Ziyu Liu, Shengyuan Ding, Xinyu Fang +7 more
Vision-to-code tasks require models to reconstruct structured visual inputs, such as charts, tables, and SVGs, into executable or structured...
1 weeks ago cs.CV cs.AI
PDF
Benchmark MEDIUM
Zhifang Zhang, Bojun Yang, Shuo He +5 more
Despite the strong multimodal performance, large vision-language models (LVLMs) are vulnerable during fine-tuning to backdoor attacks, where...
1 weeks ago cs.CV cs.CR
PDF
Track AI security vulnerabilities in real time
Get breaking CVE alerts, compliance reports (ISO 42001, EU AI Act),
and CISO risk assessments for your AI/ML stack.
Start 14-Day Free Trial