This project was submitted by Sunishchal Dev. It was one of the top submissions in our AI Alignment course (Mar 2024). Participants worked on these projects for 4 weeks.
Model-written evaluations for AI safety benchmarking differ from human-written ones, leading to biases in how LLMs respond. They have issues with structure, formatting, and hallucination; often they exhibit unique semantic styles, too. We highlight concerns of false negatives where a lab could unwittingly deploy an unsafe model due to these issues. We propose a suite of QA checks to catch basic issues in model-written evals along with a research agenda that seeks to understand and rectify the full extent of the differences with the gold standard.
Read the full piece here.
