Do models like DeepSeek-R1 know when they’re being Evaluated?
And what does ‘being evaluated’ even mean?
This project was submitted by Kaushik Srivatsan. It was a Top Submission for the Technical AI Safety Project Sprint (Jan 2026). Participants worked on these projects for 5 weeks. The text below is an excerpt from the final project.
Frontier models have been tested and released in record numbers. Every other day, a new model is released by one of the AI Race competitors, with top scores in new benchmarks and statistics demonstrating its capabilities and placing it in the LLM leaderboards. As AI safety seeks to keep up with cutting-edge organisations, safety benchmarks are used to test these models in assessing their safety requirements and risk.
Consider a scenario in which you do a safety evaluation of one of these models. The model passes these tests. But what if it passed because it was aware you were testing it?
This is not just a hypothetical situation. Anthropic discovered that Claude Sonnet 4.5, which is trained to be typically safer in many risks such as CBRN and cybersecurity compared to its previous generation, performed higher in the safety benchmarks just because it was aware it was being tested. When the awareness was removed, the safety scores decreased. Evaluation-awareness refers to the circumstance in which models are aware that they are being evaluated. Apollo and OpenAI discovered this happening with the O3 models, which were specifically designed to be anti-scheming and safer.
While these scenarios are still in their early stages, primarily in lab-based specialised model settings, they are quite concerning, as we may be unable to trust any of the results from safety and alignment benchmarks for various models available. Studies have also shown that eval-awareness increases as models scale, with frontier models appearing to be the most eval-aware. As models’ performance and complexity increase, it is possible that they will exhibit eval and situational awareness, making capability and safety assessments extremely doubtful and difficult.
So far, many of the major experiments on evaluation awareness have used closed-source frontier models. As a result, I decided to investigate eval awareness in other open source frontier models, particularly reasoning models such as DeepSeek-R1, by monitoring their Chain of Thought (CoT). I was particularly interested in answering these questions: Can models like DeepSeek-R1 detect evals, and does knowing change their behaviour?
Full Project
You can view the full project here.



