Using an SAE as a steering vector
Submitted by Nelson Gardner-Challis on our AI alignment (2024 March) course
This project won the ‘Education and community building’ prize on our AI Alignment (March 2024) course.
Full Summary
Over the past 12 weeks I have participated in the AI Safety Fundamentals Alignment course and completed an AI Alignment focused capstone project. My project explored concepts from the field of Mechanistic Interpretability. I created a tutorial for the SAE Lens Github repository, which teaches you how to use a Sparse Auto-encoder (SAE) to create a steering vector and affect a model’s output on generated responses.
I had two main aims while completing this project:
Deepen my Experience: I chose a topic I found intriguing during the course and aimed to learn more about it by direct learning.
Create a Public Good: I wanted to contribute something useful for others interested in AI safety, like a tutorial that could aid their own learning.
Key Outcome:
The tutorial is available as part of the SAE Lens tutorial list.
<Embed url=”https://storage.k8s.bluedot.org/website-assets/migrated/4d833efd-820c-4fad-91fd-57d4cbd89278.jpeg” />
A steered response to the general prompt of ‘What is on your mind?’
Read the full piece here.
