This project was submitted by Dmitrii Volkov. It was a runner-up for the ‘Novel Research’ prize in our AI Alignment course (Mar 2024). Participants worked on these projects for 4 weeks.
We show that extensive LLM safety fine-tuning is easily removed when an attacker has access to model weights. Our primary contributions are to:
Update BadLlama [GLRSL24] result to Llama 3,
Improve their result from tens of GPU-hours and $200 to 30min and $0
Use robust standard benchmarks; and
Release reproducible evaluations.
Read the full piece here.
