This project was submitted by Amy Andrews. It was one of the top submissions in our AI Alignment course (Mar 2024). Participants worked on these projects for 4 weeks.
I aimed to create a realization of the description of a tree gridworld given in Shah et al (2022), ready to be used in a project to replicate their goal misgeneralization results by also being able to run a reinforcement learning agent through it. Through this project, I aimed to build up reinforcement-learning environment software engineering experience and eventually to get a ‘gearsy’ inside-view/ understanding of this specific instance of inner alignment and goal misgeneralization.
Read the full piece here.
