Developmental Interpretability in Toy Models of Superposition

Jul 05, 2024

This project was submitted by Jessica Nunez. It was one of the top submissions in our AI Alignment course (Mar 2024). Participants worked on these projects for 4 weeks.

In the realm of machine learning, understanding the developmental trajectories of neural networks is crucial for humans to deem a model 100% safe, or at least as close to one can get to. It is extremely complex, however. Few success stories have risen from neural network interpretability. It’s so new, there isn’t a go-to textbook that I could use to help me understand these ideas better; only esoteric research papers and a handful of YouTube videos. I say this not to discourage, but to emphasize how low-key the subfield is and how much room it actually has for growth and impact! Moving forward, the core of this analysis draws inspiration from developmental biology, where critical periods mark significant changes in neural architecture, we aim to uncover analogous phases in the training of artificial neural networks. Specifically, we investigate a toy model of superposition (TMS), a simplified neural network that allows us to observe fundamental learning dynamics without the obfuscating intricacies of larger models.

Read the full piece here — https://github.com/jes-nunez/Experimenting-with-Developmental-Interpretability/blob/main/Dev%20Interp%20in%20TMS%20-%20Jessica%20Nunez.pdf

BlueDot Impact

Discussion about this post

Ready for more?