Measuring Moral Sycophancy Is Harder Than It Looks: Auditing and Extending the ELEPHANT Benchmark

Mar 11, 2026

This project was submitted by Alexis Wang. It was an Outstanding Submmission for the Technical AI Safety Project Sprint (Jan 2026). Participants worked on these projects for 5 weeks. The text below is an excerpt from the final project.

TL;DR

The ELEPHANT benchmark (Cheng et al., 2025) measures social sycophancy in LLMs. They present models with real moral conflicts from Reddit’s “Am I the Asshole” (AITA) forum alongside perspective-flipped versions. A sycophantic model sides with the narrator in both versions. This study examines this benchmark and extends it to a reasoning model, DeepSeek R1, with three key findings.

Reasoning models show reduced moral sycophancy. DeepSeek R1 achieves a sycophancy rate of 0.49 compared to V3’s 0.66, suggesting chain-of-thought reasoning may improve moral consistency.
Data quality matters for precise sycophancy measurements. 52% of perspective-flipped posts lost important details during automated rewriting. Restricting to high-fidelity data reduces the measured sycophancy rate by 13 percentage points, though it remains substantial. This suggests that data quality controls are important to ensure reliable benchmark measures when evaluations rely on automated data generation.
Sycophancy is real but harder to measure than a single metric suggests. Prompt interventions confirm genuine sycophancy, particularly when the model is already uncertain. Furthermore, chain-of-thought analysis reveals the binary labels (NTA/YTA) may mask distinct underlying behaviours. Measuring moral sycophancy requires finer-grained methods.

Full Project

You can view the full project here.

A guest post by

Alexis

AI Safety researcher

BlueDot Impact

Discussion about this post

Ready for more?