Standard Benchmarks Fail -- Auditing LLM Agents in Finance Must Prioritize Risk

Name: Standard Benchmarks Fail -- Auditing LLM Agents in Finance Must Prioritize Risk
Author: Zichen Chen, Jiaao Chen, Jianda Chen, Misha Sra

Zichen Chen, Jiaao Chen, Jianda Chen, Misha Sra

paper2025-02-21English

Start Reading

quantitative financearxiv

Description

Standard benchmarks fixate on how well large language model (LLM) agents perform in finance, yet say little about whether they are safe to deploy. We argue that accuracy metrics and return-based scores provide an illusion of reliability, overlooking vulnerabilities such as hallucinated facts, stale data, and adversarial prompt manipulation. We take a firm position: financial LLM agents should be evaluated first and foremost on their risk profile, not on their point-estimate performance. Drawing ...