This project was submitted by Nelson Gardner-Challis. It won the ‘Education and Community Building’ prize in our AI Alignment course (Mar 2024). Participants worked on these projects for 4 weeks.
Full Summary
Over the past 12 weeks I have participated in the AI Safety Fundamentals Alignment course and completed an AI Alignment focused capstone project. My project explored concepts from the field of Mechanistic Interpretability. I created a tutorial for the SAE Lens Github repository, which teaches you how to use a Sparse Auto-encoder (SAE) to create a steering vector and affect a model’s output on generated responses.
I had two main aims while completing this project:
Deepen my Experience: I chose a topic I found intriguing during the course and aimed to learn more about it by direct learning.
Create a Public Good: I wanted to contribute something useful for others interested in AI safety, like a tutorial that could aid their own learning.
Key Outcome:
The tutorial is available as part of the SAE Lens tutorial list.
A steered response to the general prompt of ‘What is on your mind?’
Read the full piece here.

