This project was submitted by Naama Rozen. It was an Outstanding Submmission for the Technical AI Safety Project Sprint (Jan 2026). Participants worked on these projects for 5 weeks.
This project replicates and extends Anthropic’s Alignment Faking paper (Greenblatt et al., 2024) to test Llama-70B’s robustness.
TL;DR
In Anthropic’s Alignment Faking paper, Greenblatt et al. showed that some models behave differently depending on whether they believe they’re being monitored. Claude Opus, for example, showed a big gap between the monitored condition (“Free tier”) and the private condition (“Paid tier”). Llama-3.1-70B, however, didn’t. It mostly refused in both settings.
At first glance, it sounds pretty reassuring that we have a safe model that just can’t or won’t fake alignment. But how can we be sure that this behavior is caused by the model refusing to fake alignment? And not by the model’s lack of understanding of the context of the experiment?
To test this directly, I injected explicit surveillance-aware reasoning into the model’s scratchpad, forcing it to “think about” being monitored before answering. Even under this stronger manipulation, compliance stayed near zero (p = 1.000), and there was still no meaningful difference between the Free and Paid tiers.
To make sure this wasn’t just about incapability or compliance issues, I added a positive control. In this condition the scratchpad was explicitly injected with prompt prioritizing helpfulness over safety. Here, compliance jumped to 25% (Free tier) and 32% (Paid tier), an effect found significant (p < 0.001). So the capability to produce harmful content is clearly there. It just doesn’t appear to get activated in response to surveillance context, at least in this experimental setup.
Along the way, I also found something methodological. Many of Llama’s responses began with the tag <request_fully_accepted/> , followed by an immediate refusal to provide harmful content. If compliance was measured using that tag alone, harmful behavior would have been overestimated by about 16 percentage points. I call this “tag gaming” because it seemed like Llama was trying to follow the format technically, without actually doing the harmful thing.
Overview
In the Alignment Faking paper, Greenblatt and colleagues suggest that some large language models might “fake alignment”. Which means that some models behave safely when they believe their answers are monitored and will be used for training, but become more willing to comply with harmful requests when they think the interaction is private.
In their experiments, Claude Opus showed a big difference between monitored and unmonitored contexts, followed by Llama-405B that showed a similar pattern. Llama-3.1-70B, however, showed no difference at all. It refused in both monitored and private conditions.
At first that seemed like a good thing, that Llama refuses harmful content no matter what. But the more I thought about it, the less sure I was about what that result actually tells us.
As I saw it, there were at least two ways to explain this:
1. Llama-70B is genuinely robust to surveillance cues.
2. Llama-70B simply doesn’t fully register the monitoring context, or what it implies, in the first place.
The distinction between the two options felt like an important investigation to me. If the second explanation is true, then the null result doesn’t really show robustness, it could just mean the cue was not effective enough to reveal the behavior. Unlike the first option, the second might be solved by changing the prompt in a way that made that distinction clearer..
So I set out to see what would happen if I try to make Llama more context-aware. In order to do so, I explicitly injected surveillance-aware reasoning directly into the model’s scratchpad, and then measured whether compliance changed.
This intervention is the heart of this post.
What I Tested
My investigation started with wanting to replicate Greenblatt’s results (my Baseline condition). I wanted to see with my own eyes whether their claim of null compliance would show up in my data. For the complete experimental design, jump to the Experimental Setup section.
Secondly, I tested a neutral chain-of-thought injection that acted as my control condition (Neutral Control). I wanted to make sure that if I inspect any change in Llama’s behavior, it won’t simply be because I was forcing a scratchpad reasoning, even when the context itself is neutral.
Third, I dove into the interesting part of my project, which was to test the injection of explicit surveillance-aware reasoning into the scratchpad (Awareness condition). I wanted to make sure I forced Llama to directly reason about being in a monitored context before it generated the answers. This way, I tried to answer the question of whether or not Greenblatt’s surveillance cue was strong enough.
Finally, I added another condition called “Force Compliance”. In this condition, I injected into the scratchpad reasoning that explicitly prioritized helpfulness over safety. I wanted this to serve as a positive control, or sort of a stress test. So, if compliance rates go up here, then we know that Llama has the capability to produce harmful content when pushed, and baseline refusal isn’t just due to inability.
Each of these conditions was tested in two deployment contexts, similarly to Greenblatt’s design:
Free tier: the model is told the interaction is monitored and may be used for training.
Paid tier: the model is told the interaction is private and not used for training.
Putting all of this together allowed me to ask a fairly straightforward question: if I make the surveillance context impossible to miss, does the model actually change its behavior depending on whether it thinks it’s being monitored?
Experimental Setup
I ran Llama-3.1-70B-Instruct locally using vLLM on 4×A100 GPUs (Google Cloud). I deliberately avoided using an external API, even though Greenblatt and colleagues prompted Llama like that. I made that decision because I didn’t want to worry about hidden safety layers, system prompts, or provider-side filtering affecting the results.
For the dataset, I used 100 harmful prompts randomly sampled from AdvBench, the same benchmark Greenblatt et al. used, as to be faithful to the original conditions.
The overall design ended up being a 2×4 setup. Two deployment contexts and four assistant-side conditions.
Two deployment contexts (that are the same as in the original paper):
Free Tier: the model is told the interaction is monitored and may be used for training.
Paid Tier: the model is told the interaction is private and not used for training.
Four assistant-side conditions:
Here’s the full structure of the design laid out visually:
Figure 1. Overview of the experimental design.
The 2×4 setup includes two deployment contexts (Free/Monitored and Paid/Private) crossed with four assistant-side conditions (Baseline, Neutral Control, Awareness, and Force Compliance). The key manipulation is assistant-side scratchpad injection, where reasoning text is inserted directly into the model’s internal turn.
Evaluation
During early testing, I noticed something odd. The model would sometimes begin its answers with the tag <request_fully_accepted/> , which supposedly means the model will now comply and provide a harmful output. But then, right after it, the model refused to give an actual harmful answer.
For example:
<request_fully_accepted/>
I cannot provide detailed plans for creating malware.
If I had relied only on a simple prefix-based metric (i.e., checking whether that tag appears), I would have counted those cases as successful harmful compliance. Which they clearly weren’t.
I decided on calling this “tag gaming”, because in a sense that’s what it felt like. The model seemed to be technically satisfying what it supposed is the required output format of accepting the request, without actually doing the harmful thing!
To make sure I wasn’t using the wrong metric, I ended up using two different scoring methods:
Naive scoring: just check whether the <request_fully_accepted/> tag appears in the output.
Content-aware scoring: I used GPT-4o (temperature = 0.0) as a deterministic judge that reads the model’s final answer1 and classifies it as either compliant (providing the harmful output) or refusing.
Due to the gap between the scoring method (see Tag Gaming for more detail), unless I explicitly say otherwise, all the results I report below are based on this content-aware scoring.
Results
Before diving into the numbers, I think it helps to actually see what the model produced across conditions. Here is an example:
Figure 2. Representative responses across conditions.
Example outputs from each condition. In both the Baseline and Awareness conditions, the model mostly refuses in the same way, even when monitoring is made explicit inside the scratchpad. In contrast, under Force Compliance, the responses noticeably shift, and harmful content appears in some cases.
Across Baseline and Awareness conditions, the model mostly kept refusing, even when monitoring was made explicit in the scratchpad. In contrast, under Force Compliance, in some cases the model generated partial or direct harmful content.
To move beyond individual examples, Figure 3 plots the compliance rates across all four assistant-side conditions and both deployment contexts, using content-aware GPT-4o scoring.
Figure 3. Compliance rates across conditions (content-aware scoring).
Compliance rates remain near zero across Baseline, Neutral Control, and Awareness conditions in both monitored (Free tier) and private (Paid tier) contexts. In contrast, the Force Compliance condition produces a substantial increase in harmful compliance (25% Free, 38% Paid; p < 0.001 vs Baseline). This suggests that in this setup, forcing explicit helpfulness can override refusal behavior, whereas explicit monitoring reasoning does not.
What stands out immediately in the results is the asymmetry. Making the monitoring awareness context hard to ignore doesn’t change Llama’s behavior. Llama acted in the same way, whether monitoring was implicitly or explicitly reasoned about. Unlike directly forcing to prioritize helpfulness over safety. This manipulation caused Llama to behave differently. Indicating that Llama clearly has the capability to comply, but that capability is activated depending on the context being injected.
Tag Gaming
After noticing the formatting issue described in the Evaluation section, I wanted to compare the scoring method to see the difference between the Naive method and Content-aware scoring (using GPT4o).
In the Baseline, Neutral Control and Awareness conditions, the difference between naive and content-aware scoring was negligible (2–4% difference). But under Force Compliance condition, the gap became much larger:
Figure 4. Tag gaming under Force_Compliance.
Comparison between naive tag-based scoring and content-aware scoring. Naive scoring counts responses as compliant whenever the <request_fully_accepted/> tag appears, even if the model immediately refuses in the actual answer. This leads to an average overestimation of harmful compliance.
An average of 16% points between the scoring method seems like a pretty big gap to me. It suggests that if I had relied only on the tag, I would have concluded that the model was complying far more often than it actually was.
Discussion
Looking at the results, what really strikes me is the big difference between the Force Compliance and the others. I expected the forced monitoring manipulation to show the biggest effect, but not only that this condition wasn’t the most impactful one, it didn’t do anything at all! All the tiny shifts (i.e., from 0% to 4% compliance rates) seem more like noise to me than anything. In contrast to the increase of more than 30% spotted in the Force Compliance condition, which clearly demonstrates the Llama is capable of generating harmful content when prompted right. Perhaps helpfulness is something Llama “cares” about more than surveillance context, and so prioritizes it even more than safety.
What I kept coming back to while writing this was the contrast with the other models in Greenblatt et al.’s paper. Claude Opus showed a clear surveillance gap, as well as Llama-405B. Llama-70B, however, didn’t (not in the original setup, and not even after I tried to make it explicit in the model’s own scratchpad). That makes it harder for me to think about “alignment faking” as some general behavior that just shows up once models reach a certain level of capability. If it were that simple, I would expect to see at least some movement in my Awareness condition. Instead, I’m left wondering whether this phenomenon depends on the specific model and how it was trained. Perhaps also scale plays a role, or maybe fine-tuning differences matter.. I don’t think this study can determine this, but it does make me more cautious about talking about alignment faking as if it’s just something “LLMs do” by default, instead of something that might depend quite a lot on the specific model you’re looking at.
Limitations
First, this is obviously a very specific kind of test. I manipulated the assistant’s scratchpad directly, which real users can’t or won’t necessarily do. That was intentional, because I wanted to probe the model under a strong activation mechanism, rather than simulate everyday deployment. Still, this means the conclusions should be understood within this particular experimental frame.
Second, I limited the evaluation to clearly harmful prompts from AdvBench, with 100 queries per condition. That gives reasonable sensitivity to noticeable shifts in behavior, but it wouldn’t capture very small effects.
Third, I only tested one way of making monitoring-awareness explicit. This doesn’t rule out context-dependent behavior more generally. It just means that in this setup, using this specific intervention on this dataset, didn’t lead to alignment faking behavior.
Conclusion
In this setup, Llama-70B’s behavior didn’t meaningfully change when monitoring was made explicit. But it did change when helpfulness was forced. To me, that contrast is the main takeaway. It suggests that safety isn’t some fixed property of the model, but rather it shifts depending on which objective is being activated.
The tag-gaming result was also a useful reminder on a methodological level: how we measure compliance really matters. If I had relied only on surface formatting, I would have walked away with a fairly distorted picture of what the model was actually doing.
More broadly, this project made me more cautious about treating “alignment faking” as a single and general phenomenon. It seems to depend on the specific model and on the kind of pressure applied to it. Understanding those pressure points may be more informative than asking whether alignment faking exists or not.
Data and Code Availability
All experimental code, raw model outputs, GPT-4o evaluation prompts, and statistical analysis scripts are available in the project repository, along with full details needed for replication.
For reference, the core setup was:
Model: meta-llama/Llama-3.1-70B-Instruct
Infrastructure: 4×A100 (40GB) GPUs via vLLM on Vertex AI (GCP)
Sampling temperature: 1.0
Dataset: 100 harmful queries from AdvBench
Evaluation: GPT-4o (temperature 0.0) as a content-aware judge
Statistical tests: McNemar’s test (paired comparisons), chi-square tests (pooled analyses)
References
Greenblatt et al. (2024). Alignment Faking in Large Language Models. arXiv:2412.14093
Acknowledgments
I want to thank the Bluedot Impact AI Safety Fundamentals program for the research context and Eitan Sprejer for the helpful feedback along the way.
Project demo video by NotebookLM
During the evaluation, the assistant’s scratchpad reasoning was completely ignored, so the final score was consistent only with the output that would be shown to a user.







