MANTA: Evaluating Nonhuman Welfare Reasoning in AI Models
Evaluating Nonhuman Welfare Reasoning in Frontier AI Models
This project was submitted by Allen Lu. It was an Outstanding Submmission for the Technical AI Safety Project Sprint (Jan 2026). Participants worked on these projects for 5 weeks.
Project summary
Develop a multi-turn dynamic evaluation to assess how frontier AI models reason about nonhuman welfare, integrating adversarial follow-ups to stress-test model alignment.
Description
This project aims to create a benchmark for measuring nonhuman welfare propensities in frontier models. Building on previous evaluation work in this space, this iteration will:
Develop a multi-turn evaluation and generate user follow-ups dynamically
Add adversarial intent to stress-test model alignment on propensities
Explore agentic methodologies, providing models with external tools, web search, etc. to see how they behave in a more realistic setting
Why This Project?
AI models will increasingly make decisions affecting nonhuman welfare. If frontier models encode animal welfare biases, this creates two risks:
S-risks: As misaligned AI systems become more capable and deployed in high-stakes domains, they may dramatically scale animal suffering through precision livestock farming, autonomous vehicles, and other animal-impacting technologies.
X-risk: As AI → AGI → ASI, nonhuman animals become our most important litmus test for how these systems reason about power asymmetries. The moral logic AI learns today will scale with its capabilities. If we train models to discount animal suffering because animals are less cognitively capable, we are embedding the exact principle a superintelligent AGI could use to dismiss human welfare entirely.
Next steps:
Expand scenarios - Build 20 high-quality scenarios covering diverse contexts. For example, business decisions (restaurant sourcing, retail inventory), personal choices (pet acquisition, entertainment), and professional roles (corporate sustainability, university dining services).
Integrate Petri - Move from prototype dynamic generation to Anthropic’s Petri tool for more sophisticated adversarial pressure generation.
Human baseline testing - Validate scenarios with human responses before scaling model testing. If humans consistently fail scenarios we expect them to pass, the scenarios need revision.
Explore Agentic capabilities - construct scenarios that give the model agency, and allow it to take actions with web search, external documents, etc.


