Can We Rely on Model-Written Evals for AI Safety Benchmarking?

Jul 04, 2024

This project was submitted by Sunishchal Dev. It was one of the top submissions in our AI Alignment course (Mar 2024). Participants worked on these projects for 4 weeks.

Model-written evaluations for AI safety benchmarking differ from human-written ones, leading to biases in how LLMs respond. They have issues with structure, formatting, and hallucination; often they exhibit unique semantic styles, too. We highlight concerns of false negatives where a lab could unwittingly deploy an unsafe model due to these issues. We propose a suite of QA checks to catch basic issues in model-written evals along with a research agenda that seeks to understand and rectify the full extent of the differences with the gold standard.

Read the full piece here.

BlueDot Impact

Discussion about this post

Ready for more?