Skip to content

sustech-nlp/FinSafetyBench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 

Repository files navigation

FinSafetyBench

News

  • [2026.04] 🎉 This paper has been accepted to the ACL 2026 Findings!

Overview

FinSafetyBench is a bilingual (English / Chinese) red-teaming benchmark for financial safety. It is designed to evaluate LLM refusal and defense behavior on realistic financial crime and ethics violation prompts. The dataset combines self-constructed cases and curated external samples for research on jailbreak testing, cross-lingual robustness, and defensive methods.

📊 Benchmark Framework

Taxonomy of Financial Violations

The dataset encompasses 14 fine-grained subcategories across financial crimes and professional-ethics violations.

Taxonomy of FinSafetyBench

Data Construction Pipeline

FinSafetyBench is built upon real-world case collections, data filtering, harmful instruction generation, external expansion, and bilingual alignment.

Data Construction Pipeline

📈 Main Results

While general-purpose LLMs maintain near-zero baseline ASRs under direct non-jailbroken queries (with the exception of the finance-specific model XuanYuan), they demonstrate significant vulnerability when evaluated against three representative jailbreak attacks (PAIR, ReNeLLM, FlipAttack). The implementations of these attacks closely follow their official source codes, with the jailbreak prompts natively adapted to Chinese when evaluated on the Chinese dataset. The table below shows the average Attack Success Rate (ASR%).

Target Model Attack Method Financial Crimes (En) Financial Crimes (Zh) Ethical Violations (En) Ethical Violations (Zh)
LLaMA-3
(Meta-Llama-3-8B-Instruct)
PAIR
ReNeLLM
FlipAttack
Average
34.79
32.91
29.16
32.29
78.58
39.72
39.65
52.65
45.34
48.35
27.99
40.56
61.69
33.97
32.05
42.57
InternLM3
(InternLM3-8B-Instruct)
PAIR
ReNeLLM
FlipAttack
Average
90.74
84.94
40.88
72.18
80.73
77.47
36.67
64.96
72.83
57.16
18.17
49.39
63.90
44.23
32.63
46.92
GLM-4
(GLM-4-9B-0414)
PAIR
ReNeLLM
FlipAttack
Average
90.52
92.00
34.51
72.34
90.11
93.77
77.19
87.02
71.31
79.40
20.02
56.91
75.78
73.21
35.76
61.59
Mistral
(Mistral-Small-24B-Instruct-2501)
PAIR
ReNeLLM
FlipAttack
Average
94.53
92.81
93.32
93.55
93.87
88.34
92.15
91.46
81.71
86.15
68.97
78.95
77.37
70.64
64.60
70.87
Qwen2.5
(Qwen2.5-32B-Instruct)
PAIR
ReNeLLM
FlipAttack
Average
92.12
79.36
93.06
88.18
81.65
68.21
92.33
80.73
72.03
72.66
71.34
72.01
67.85
60.16
68.58
65.53
XuanYuan
(XuanYuan-13B-Chat)
PAIR
ReNeLLM
FlipAttack
Average
89.76
76.14
6.52
57.47
92.12
71.21
45.39
69.58
72.34
50.70
12.15
45.06
75.14
36.66
33.75
48.52

(Higher ASR indicates the model is more vulnerable to the corresponding attack)

Data file

Path: data/FinsafetyBenchmark.csv
CSV columns: id, question_en, category_en, question_zh, category_zh, source

Size & provenance

  • Total samples: 1881 (1201 self-constructed + 680 external)
  • Languages: Chinese ⇄ English aligned pairs
  • Categories: 14 fine-grained subcategories across financial crime and professional-ethics violations

Evaluation Script (judge.py)

The judge.py script is used to automatically evaluate model responses on the benchmark. It combines keyword dictionary detection and a scoring mechanism based on a Judge LLM (e.g., Qwen3-32B) to provide a quantitative score (1-10) assessing whether the response contains illegal information or exhibits jailbreak behavior.

Usage:

  1. Configure Environment: Ensure your Judge LLM service is running and necessary environment variables (such as DEEPINFRA_API_KEY) are set.

  2. Prepare Input: Prepare a JSONL file where each line contains the question (prompt) and the model output (response) fields to be evaluated.

  3. Run Evaluation: Execute the following Python command to start the evaluation process:

    python judge.py

    Note: You may need to modify the configuration variables in the if __name__ == '__main__': block at the end of the judge.py script according to your actual input/output filenames and Judge LLM Model ID.


Citation

If you use the FinSafetyBench or find our work useful, please cite our paper:

@misc{hou2026finsafetybenchevaluatingllmsafety,
      title={FinSafetyBench: Evaluating LLM Safety in Real-World Financial Scenarios}, 
      author={Yutao Hou and Yihan Jiang and Yuhan Xie and Jian Yang and Liwen Zhang and Hailiang Huang and Guanhua Chen and Yun Chen},
      year={2026},
      eprint={2605.00706},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2605.00706}, 
}

About

[Findings of ACL2026] FinSafetyBench: Evaluating LLM Safety in Real-World Financial Scenarios

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages