AI Debate Stability

Addressing Self-Defeating Responses

Jul 05, 2024

This project was submitted by Annie Sorkin. It was one of the top submissions in our AI Alignment course (Mar 2024). Participants worked on these projects for 4 weeks.

Transferring debate to an abstract algebra MMLU dataset is not trivial.
When GPT-3.5 is used as a judge, the outcomes may be sensitive to exact prompt phrasing.
GPT-3.5 may perform worse in judging the debate than answering the question directly.
We proposed a universal prompting approach that avoids most of the self-defeating behavior.

Read the full piece here.

BlueDot Impact

Discussion about this post

Ready for more?