Avoiding jailbreaks by discouraging their representation in activation space

Oct 17, 2024

This project was submitted by Guido Ernesto Bergman. It was one of the top submissions in our AI Alignment course (Jun 2024). Participants worked on these projects for 4 weeks. The text below is an excerpt from the final project.

Abstract

The goal of this project is to answer two questions: “Can jailbreaks be represented as a linear direction in activation space?” and if so, “Can that direction be used to prevent the success of jailbreaks?”. The difference-in-means technique was utilized to search for a direction in activation space that represents jailbreaks. After that, the model was intervened using activation addition and directional ablation. The activation addition intervention caused the attack success rate of jailbreaks to drop from 60% to 0%, suggesting that a direction representing jailbreaks might exist and disabling it could make all jailbreaks unsuccessful. However, further research is needed to assess whether these findings generalize to novel jailbreaks. On the other hand, both interventions came at the cost of reducing helpfulness by making the model refuse some harmless prompts.

Full project

You can view the full project here.

BlueDot Impact

Discussion about this post

Ready for more?