This project was submitted by Emile Delcourt. It was one of the top submissions in our AI Alignment course (Mar 2024). Participants worked on these projects for 4 weeks.
The ability to discriminate risky scenarios vs low-risk ones is one of the “hidden features” present in most later layers of Mistral7B. I developed and analyzed “linear probes” on those layers’ hidden activations, and found confidence that the model generally knows when “something is up” with a proposed text, vs low risk scenarios (F1>0.85 for 4 layers; AUC in some layers exceeds 0.96). The top neurons activating in risky scenarios also have security oriented effect on outputs, most increasing words (tokens) like “Virus” or “Attack” and questioning “necessity” or likelihood. My findings give me confidence that we could develop LLM based risk assessment systems (minus some signal/noise tradeoffs).
Read the full piece here.
