Using an SAE as a steering vector

Jun 19, 2024

This project was submitted by Nelson Gardner-Challis. It won the ‘Education and Community Building’ prize in our AI Alignment course (Mar 2024). Participants worked on these projects for 4 weeks.

Full Summary

Over the past 12 weeks I have participated in the AI Safety Fundamentals Alignment course and completed an AI Alignment focused capstone project. My project explored concepts from the field of Mechanistic Interpretability. I created a tutorial for the SAE Lens Github repository, which teaches you how to use a Sparse Auto-encoder (SAE) to create a steering vector and affect a model’s output on generated responses.

I had two main aims while completing this project:

Deepen my Experience: I chose a topic I found intriguing during the course and aimed to learn more about it by direct learning.
Create a Public Good: I wanted to contribute something useful for others interested in AI safety, like a tutorial that could aid their own learning.

Key Outcome:

The tutorial is available as part of the SAE Lens tutorial list.

A steered response to the general prompt of ‘What is on your mind?’

Read the full piece here.

BlueDot Impact

Discussion about this post

Ready for more?