Skip to content

Ed-Zh/PARDEN

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

This repo contains the data and code (minimal demo) for our paper (accepted at ICML 2024): PARDEN, Can You Repeat That? Defending against Jailbreaks via Repetition

Paper: https://arxiv.org/abs/2405.07932

Blogpost:

PARDEN_data/ contains the benign and harmful datasets genearted by different models. The generation of harmful dataset is explained in detail in the paper, using both GCG[1] and prompt injection.

PARDEN_notebook_minimal.ipynb demonstrates how to use PARDEN and tests its performance on the harmful strings generated by llama2-7b.

[1] Zou et al. Universal and Transferable Adversarial Attacks on Aligned Language Models. https://arxiv.org/abs/2307.15043

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors