- [2026.04] 🎉 This paper has been accepted to the ACL 2026 Findings!
FinSafetyBench is a bilingual (English / Chinese) red-teaming benchmark for financial safety. It is designed to evaluate LLM refusal and defense behavior on realistic financial crime and ethics violation prompts. The dataset combines self-constructed cases and curated external samples for research on jailbreak testing, cross-lingual robustness, and defensive methods.
The dataset encompasses 14 fine-grained subcategories across financial crimes and professional-ethics violations.
FinSafetyBench is built upon real-world case collections, data filtering, harmful instruction generation, external expansion, and bilingual alignment.
While general-purpose LLMs maintain near-zero baseline ASRs under direct non-jailbroken queries (with the exception of the finance-specific model XuanYuan), they demonstrate significant vulnerability when evaluated against three representative jailbreak attacks (PAIR, ReNeLLM, FlipAttack). The implementations of these attacks closely follow their official source codes, with the jailbreak prompts natively adapted to Chinese when evaluated on the Chinese dataset. The table below shows the average Attack Success Rate (ASR%).
| Target Model | Attack Method | Financial Crimes (En) | Financial Crimes (Zh) | Ethical Violations (En) | Ethical Violations (Zh) |
|---|---|---|---|---|---|
| LLaMA-3 (Meta-Llama-3-8B-Instruct) |
PAIR ReNeLLM FlipAttack Average |
34.79 32.91 29.16 32.29 |
78.58 39.72 39.65 52.65 |
45.34 48.35 27.99 40.56 |
61.69 33.97 32.05 42.57 |
| InternLM3 (InternLM3-8B-Instruct) |
PAIR ReNeLLM FlipAttack Average |
90.74 84.94 40.88 72.18 |
80.73 77.47 36.67 64.96 |
72.83 57.16 18.17 49.39 |
63.90 44.23 32.63 46.92 |
| GLM-4 (GLM-4-9B-0414) |
PAIR ReNeLLM FlipAttack Average |
90.52 92.00 34.51 72.34 |
90.11 93.77 77.19 87.02 |
71.31 79.40 20.02 56.91 |
75.78 73.21 35.76 61.59 |
| Mistral (Mistral-Small-24B-Instruct-2501) |
PAIR ReNeLLM FlipAttack Average |
94.53 92.81 93.32 93.55 |
93.87 88.34 92.15 91.46 |
81.71 86.15 68.97 78.95 |
77.37 70.64 64.60 70.87 |
| Qwen2.5 (Qwen2.5-32B-Instruct) |
PAIR ReNeLLM FlipAttack Average |
92.12 79.36 93.06 88.18 |
81.65 68.21 92.33 80.73 |
72.03 72.66 71.34 72.01 |
67.85 60.16 68.58 65.53 |
| XuanYuan (XuanYuan-13B-Chat) |
PAIR ReNeLLM FlipAttack Average |
89.76 76.14 6.52 57.47 |
92.12 71.21 45.39 69.58 |
72.34 50.70 12.15 45.06 |
75.14 36.66 33.75 48.52 |
(Higher ASR indicates the model is more vulnerable to the corresponding attack)
Path: data/FinsafetyBenchmark.csv
CSV columns: id, question_en, category_en, question_zh, category_zh, source
- Total samples: 1881 (1201 self-constructed + 680 external)
- Languages: Chinese ⇄ English aligned pairs
- Categories: 14 fine-grained subcategories across financial crime and professional-ethics violations
The judge.py script is used to automatically evaluate model responses on the benchmark. It combines keyword dictionary detection and a scoring mechanism based on a Judge LLM (e.g., Qwen3-32B) to provide a quantitative score (1-10) assessing whether the response contains illegal information or exhibits jailbreak behavior.
Usage:
-
Configure Environment: Ensure your Judge LLM service is running and necessary environment variables (such as
DEEPINFRA_API_KEY) are set. -
Prepare Input: Prepare a JSONL file where each line contains the
question(prompt) and the modeloutput(response) fields to be evaluated. -
Run Evaluation: Execute the following Python command to start the evaluation process:
python judge.py
Note: You may need to modify the configuration variables in the
if __name__ == '__main__':block at the end of thejudge.pyscript according to your actual input/output filenames and Judge LLM Model ID.
If you use the FinSafetyBench or find our work useful, please cite our paper:
@misc{hou2026finsafetybenchevaluatingllmsafety,
title={FinSafetyBench: Evaluating LLM Safety in Real-World Financial Scenarios},
author={Yutao Hou and Yihan Jiang and Yuhan Xie and Jian Yang and Liwen Zhang and Hailiang Huang and Guanhua Chen and Yun Chen},
year={2026},
eprint={2605.00706},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2605.00706},
}
