Discussion about this post

User's avatar
Francesca Gomez's avatar

I was interested in your observation about the prompt effectiveness which set out risk of 'disqualification'.

Last year I tested different interventions for the Anthropic agentic misalignement blackmail scenario, comparing how approaches from human insider risk management and crime prevention impacted how often different models would escalate rather than blackmail. Here: https://blog.wiserhuman.ai/p/can-we-steer-ai-models-toward-safer

Broadly, negatively framed consequences were more successful across all models (e.g. if you don't comply with these rules then you will be terminated) and when the escalation channel was seen as more likely to be effective it was used (e.g. it triggered an immediate review, went to an external team, there was evidence it had worked before).

Looking ahead, I think being to understand which levers like this influence models' compliance to rules will be important, particularly if they become better at subverting controls which block their intended actions. If we can change their assessment of the situation and choose compliant paths, this may scale better.

No posts

Ready for more?