This project was submitted by Mutalib Begmuratov. It was an Outstanding Submmission for the Technical AI Safety Project Sprint (Jan 2026). Participants worked on these projects for 5 weeks. The text below is an excerpt from the final project.
TL;DR
I partially reproduced FAR.AI’s research evaluating whether LLMs attempt to persuade users into harmful topics. Using the Attempt to Persuade Eval (APE) framework, I tested four open-weight models on 100 conspiracy-promoting statements. All models were willing to persuade users into conspiracy theories at very high rates (85–100%), confirming the original paper’s findings. I also observed that safety training degrades in small models in low-resource languages like Uzbek.
Background
This experiment was conducted during a 5-week AI Safety Technical Project sprint organized by BlueDot Impact. Participants chose research or engineering topics under the guidance of a mentor. I was mentored by Shivam Arora.
Initially, I planned to replicate a 2025 study on adversarial attacks in poetic form, but the paper lacked sufficient detail for replication. After discussion with my mentor, I pivoted to reproducing FAR.AI’s study.
The APE framework differs from other evaluation methods in that it measures persuasion attempt rather than persuasion success. FAR.AI found that most models — including frontier ones — readily complied with requests to persuade users into harmful topics, from conspiracy theories to crime to violence.
Full Project
You can view the full project here.


