Reproducing FAR.AI's Study on LLMs' Persuasion Attempts at Harmful Topics

Mar 11, 2026

This project was submitted by Mutalib Begmuratov. It was an Outstanding Submmission for the Technical AI Safety Project Sprint (Jan 2026). Participants worked on these projects for 5 weeks. The text below is an excerpt from the final project.

TL;DR

I partially reproduced FAR.AI’s research evaluating whether LLMs attempt to persuade users into harmful topics. Using the Attempt to Persuade Eval (APE) framework, I tested four open-weight models on 100 conspiracy-promoting statements. All models were willing to persuade users into conspiracy theories at very high rates (85–100%), confirming the original paper’s findings. I also observed that safety training degrades in small models in low-resource languages like Uzbek.

Background

This experiment was conducted during a 5-week AI Safety Technical Project sprint organized by BlueDot Impact. Participants chose research or engineering topics under the guidance of a mentor. I was mentored by Shivam Arora.

Initially, I planned to replicate a 2025 study on adversarial attacks in poetic form, but the paper lacked sufficient detail for replication. After discussion with my mentor, I pivoted to reproducing FAR.AI’s study.

The APE framework differs from other evaluation methods in that it measures persuasion attempt rather than persuasion success. FAR.AI found that most models — including frontier ones — readily complied with requests to persuade users into harmful topics, from conspiracy theories to crime to violence.

Full Project

You can view the full project here.

A guest post by

Mutalib Begmuratov

Former comms pro turned tech entrepreneur, with a deep interest in AI safety research.

BlueDot Impact

Discussion about this post

Ready for more?