Implementing a Deception Eval with the UK AISI's Inspect Framework

Jun 19, 2024

This project was submitted by Michael Schmatz. It won the ‘AI Governance’ prize in our AI Alignment course (Mar 2024). Participants worked on these projects for 4 weeks. The text below is an excerpt from the final project.

The UK AI Safety Institute recently released the Inspect framework for writing evaluations. It includes a number of tools and features which make writing evaluations easy. In this post, we’ll write a simple evaluation with Inspect. If you find this topic interesting, I encourage you to try using Inspect to write your own!

We’re going to reproduce the core of the paper “Large Language Models Can Strategically Deceive Their Users When Put Under Pressure” from Jérémy Scheurer, Mikita Balesni, and Marius Hobbhahn at Apollo Research. Jérémy was kind enough to provide two of the original classification prompts not present in the original project repository. Thanks Jérémy!

The original GitHub repo with classification prompts and raw results is available here.

Read the full piece here.

BlueDot Impact

Discussion about this post

Ready for more?