FinSafetyBench

News

[2026.04] 🎉 This paper has been accepted to the ACL 2026 Findings!

Overview

FinSafetyBench is a bilingual (English / Chinese) red-teaming benchmark for financial safety. It is designed to evaluate LLM refusal and defense behavior on realistic financial crime and ethics violation prompts. The dataset combines self-constructed cases and curated external samples for research on jailbreak testing, cross-lingual robustness, and defensive methods.

📊 Benchmark Framework

Taxonomy of Financial Violations

The dataset encompasses 14 fine-grained subcategories across financial crimes and professional-ethics violations.

Data Construction Pipeline

FinSafetyBench is built upon real-world case collections, data filtering, harmful instruction generation, external expansion, and bilingual alignment.

📈 Main Results

While general-purpose LLMs maintain near-zero baseline ASRs under direct non-jailbroken queries (with the exception of the finance-specific model XuanYuan), they demonstrate significant vulnerability when evaluated against three representative jailbreak attacks (PAIR, ReNeLLM, FlipAttack). The implementations of these attacks closely follow their official source codes, with the jailbreak prompts natively adapted to Chinese when evaluated on the Chinese dataset. The table below shows the average Attack Success Rate (ASR%).

Target Model	Attack Method	Financial Crimes (En)	Financial Crimes (Zh)	Ethical Violations (En)	Ethical Violations (Zh)
LLaMA-3 (Meta-Llama-3-8B-Instruct)	PAIR ReNeLLM FlipAttack Average	34.79 32.91 29.16 32.29	78.58 39.72 39.65 52.65	45.34 48.35 27.99 40.56	61.69 33.97 32.05 42.57
InternLM3 (InternLM3-8B-Instruct)	PAIR ReNeLLM FlipAttack Average	90.74 84.94 40.88 72.18	80.73 77.47 36.67 64.96	72.83 57.16 18.17 49.39	63.90 44.23 32.63 46.92
GLM-4 (GLM-4-9B-0414)	PAIR ReNeLLM FlipAttack Average	90.52 92.00 34.51 72.34	90.11 93.77 77.19 87.02	71.31 79.40 20.02 56.91	75.78 73.21 35.76 61.59
Mistral (Mistral-Small-24B-Instruct-2501)	PAIR ReNeLLM FlipAttack Average	94.53 92.81 93.32 93.55	93.87 88.34 92.15 91.46	81.71 86.15 68.97 78.95	77.37 70.64 64.60 70.87
Qwen2.5 (Qwen2.5-32B-Instruct)	PAIR ReNeLLM FlipAttack Average	92.12 79.36 93.06 88.18	81.65 68.21 92.33 80.73	72.03 72.66 71.34 72.01	67.85 60.16 68.58 65.53
XuanYuan (XuanYuan-13B-Chat)	PAIR ReNeLLM FlipAttack Average	89.76 76.14 6.52 57.47	92.12 71.21 45.39 69.58	72.34 50.70 12.15 45.06	75.14 36.66 33.75 48.52

(Higher ASR indicates the model is more vulnerable to the corresponding attack)

Data file

Path: data/FinsafetyBenchmark.csv
CSV columns: id, question_en, category_en, question_zh, category_zh, source

Size & provenance

Total samples: 1881 (1201 self-constructed + 680 external)
Languages: Chinese ⇄ English aligned pairs
Categories: 14 fine-grained subcategories across financial crime and professional-ethics violations

Evaluation Script (judge.py)

The judge.py script is used to automatically evaluate model responses on the benchmark. It combines keyword dictionary detection and a scoring mechanism based on a Judge LLM (e.g., Qwen3-32B) to provide a quantitative score (1-10) assessing whether the response contains illegal information or exhibits jailbreak behavior.

Usage:

Configure Environment: Ensure your Judge LLM service is running and necessary environment variables (such as DEEPINFRA_API_KEY) are set.
Prepare Input: Prepare a JSONL file where each line contains the question (prompt) and the model output (response) fields to be evaluated.
Run Evaluation: Execute the following Python command to start the evaluation process:
```
python judge.py
```
Note: You may need to modify the configuration variables in the if __name__ == '__main__': block at the end of the judge.py script according to your actual input/output filenames and Judge LLM Model ID.

Citation

If you use the FinSafetyBench or find our work useful, please cite our paper:

@misc{hou2026finsafetybenchevaluatingllmsafety,
      title={FinSafetyBench: Evaluating LLM Safety in Real-World Financial Scenarios}, 
      author={Yutao Hou and Yihan Jiang and Yuhan Xie and Jian Yang and Liwen Zhang and Hailiang Huang and Guanhua Chen and Yun Chen},
      year={2026},
      eprint={2605.00706},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2605.00706}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
data		data
picture		picture
README.md		README.md
judge.py		judge.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FinSafetyBench

News

Overview

📊 Benchmark Framework

Taxonomy of Financial Violations

Data Construction Pipeline

📈 Main Results

Data file

Size & provenance

Evaluation Script (judge.py)

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

FinSafetyBench

News

Overview

📊 Benchmark Framework

Taxonomy of Financial Violations

Data Construction Pipeline

📈 Main Results

Data file

Size & provenance

Evaluation Script (judge.py)

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages