BlueDot Impact

Is the AGI strategy course right for you?

Li-Lian Ang — Sun, 26 Apr 2026 16:19:52 GMT

The AGI strategy course materials are freely available online, so why apply rather than go through the material yourself?

What you’re applying for is access to an 8,000+ (and growing!) community who are serious about contributing to AI safety. We also provide programs like career transition grants, rapid grants and 1-1 advising to support them.

The AGI strategy course builds your understanding of advanced AI systems — what they can do, what drives progress, what the concrete risks are, and who controls development — so you can make informed decisions about how to contribute.

We reject more than half of applications to the course. This post helps you figure out if you should spend time on an application.

What the AGI strategy course is NOT

It’s not a corporate responsible AI course

If your goal is ISO 42001 compliance, responsible AI frameworks for your product team, or helping your organisation navigate the EU AI Act, you should look into something like IAPP’s AIGP certification or ISO 42001 programs instead.

It’s not an AI ethics course

We don’t cover algorithmic bias in hiring tools, fairness metrics, or the full range of social impacts of current AI systems. If that’s your focus, you’re better served by something like this or this.

It’s not for passive learners

You’re expected to actively engage with others who are motivated to make the future go well. If you haven’t started thinking seriously about risks from advanced AI, take the 2-hour, self-paced Future of AI course first.

What gets you in

It’s not (just) your CV. We’ve rejected plenty of people from prestigious institutions with impressive resumes who wrote three-word applications, and we’ve also accepted sharp people with more modest titles who showed they were already sprinting ahead.

We’re looking for people who are already taking action: applying to roles, going to events, building things, writing, talking to people in the field. You don’t need a perfect plan, but something more specific than “apply for fellowships”. This is a line we see in dozens of applications every round.

Your application itself is evidence for us too. We use it to infer how you’ll show up in a live discussion session of 6-8 people working through hard material together. Effort, agency and clarity go a long way.

Before you apply, ask yourself this:

Am I focused on frontier AI, or AI in general?
Do I have a specific reason to take this course now?
Will I act on this fully within six months?

If you’ve read this and you’re thinking: ”yes, this is me”, apply here.

We’ve given out $50,000 in rapid grants. Now we want to triple that.

Joshua Landes — Tue, 14 Apr 2026 20:44:49 GMT

Over the past few months, BlueDot Impact’s Rapid Grants program has quietly funded 77 rapid grants totaling over $50,000 - mostly to course participants and facilitators working on technical AI safety research. Grants are small, decisions are fast, and the process is lightweight.

That program worked. So we’re making it bigger.

The majority of our grants have gone toward compute and API credits for people running evals, training small models, and replicating safety research, typically for a few hundred dollars. Our Technical AI Safety Projects Sprint has been great at surfacing these.

But some of our most impactful grants didn’t look like that at all:

AI Safety Poland received $4,800 to organize meetups across the country, covering venue and tooling costs, and building a national community from scratch. They’ve had hundreds of attendees, and dozens of people they’ve referred have ended up taking a BlueDot course.
Justin Dollman received $3,000 to lead our weekly evals reading group, coordinating a growing community of people learning about and working on model evaluations. We’ve since funded more reading groups on AI governance and AI x cyber and are excited about scaling these up from ~120 to 300 attendees per week.
Sally Gao received $1,000 to run AI Safety meetups in New York, hosting guests such as Alex Bores. Her next event on the State of AI Safety is coming up on April 23rd!
Eitán Sprejer received $4,200 in facilitator stipends to run our Technical AI Safety Projects Sprint at BAISH in Buenos Aires, accessing a new and underleveraged talent pool for AI safety work.
Aaron Maiwald received $5,000 to attend a biosecurity conference in DC, travel to SF, and connect with senior folks in the field to accelerate his journey.
Zac Saber received $8,000 to drop out of EF and validate an AI safety-focused startup instead.

None of these grants fit neatly into “compute for a project.” But they were some of our highest-impact bets. As the program picked up speed and more of these came in, we started treating rapid grants as small, focused bets on talented people doing or experimenting with high-impact projects - research, fieldbuilding, talent acceleration, you name it. In some cases, we approached people directly and pitched them on applying for work they were already doing.

Going forward, we’re expanding the official scope to match our internal understanding: Rapid Grants now fund much more than compute - we’re excited to back work on events, talent acceleration, BlueDot community building, and more. The bar for funding hasn’t changed - if anything we’ve raised it as we became more calibrated - we still look for concrete work in progress, a specific cost that’s the bottleneck, and a reason for us to believe the work matters for making AI go well.

Grant sizes now go up to $10,000 to allow us to make bigger bets for more impact. Decisions will still be fast - we’re targeting around five working days. For grants above $5,000, we may hop on a quick call.

Who should apply

If you’re in the BlueDot community - a course participant, alum, facilitator, or wider community member - and you’re doing something high-impact in one of our focus areas that a grant would accelerate, apply.

If you’re on the fence, the default, as always, is simple: apply.

Apply and see the full list of public grantees and program details at bluedot.org/grants/rapid.

Is the Frontier AI Governance course right for you?

Joshua Landes — Sun, 05 Apr 2026 23:46:13 GMT

We currently reject about 75% of applicants to the Frontier AI Governance course. Most of them aren’t bad candidates - in our view they are applying to the wrong course.

This short blog post is intended to help you figure out if you should spend time on an application.

FAIGC is at its core about the governance of frontier AI systems - AGI, ASI - the most capable models being built by a handful of labs, the decisions governments and other actors need to make as those systems become more powerful, and what happens as general capabilities rapidly approach and exceed human-level.

If your guiding question is “how should we govern AI systems that might be smarter than humans within the next few years” you’re in the right place!

What FAIGC is not

It is not a corporate AI governance course

If your goal is ISO 42001 compliance, responsible AI frameworks for your product team, or helping your organization navigate the EU AI Act - that’s real work and we respect it, but it’s not what we focus on. You should look at something like IAPP’s AIGP certification or ISO 42001 programs instead.

It is not an AI ethics course

We don’t cover algorithmic bias in hiring tools, fairness metrics, or the full range of social impacts of current AI systems. If that’s your focus, you’re better served by something like this or this.

It is not introductory

We require the AGI Strategy course or some equivalent background. If you’re still building your understanding of why this matters, apply to the AGISC first. It too is free and runs every month!

What gets you in

It’s not (just) your CV - we’ve rejected plenty of people from prestigious institutions with impressive resumes who wrote three-word applications, and we’ve also accepted sharp people with more modest titles who showed they were already sprinting ahead.

What we’re looking for: you’ve already started engaging with frontier AI governance specifically. You can name a concrete gap between where you are and what the course provides. And your post-course plan better be more specific than “apply for fellowships” - a line we see in dozens of applications every round.

Your application itself is evidence for us too. We use it to infer how you’ll show up in a live discussion session of seven to nine people working through hard material together. Effort, agency and clarity go a long way.

A note for AGI Strategy course graduates

We view completing our AGI Strategy course as a prerequisite and not as a ticket! We may still reject AGI Strategy grads who apply. If you’ve completed the AGISC and then started building something, writing something, or working on something in frontier AI governance, tell us about that. But if you completed it and your main next step is taking this course, you may need more time.

As a general heuristic, before you apply, ask yourself these three things

Am I focused on frontier AI, or AI in general?
Do I have a specific reason to take this course now?
Will I act on this fully within six months?

We’re building a pipeline into the institutions that govern frontier AI and are looking for people who have a good chance to be in those institutions - not people adding a line to their CV.

If you’ve read this and you’re thinking yes, this is me - apply here.

Quitting FinTech for AI Safety — Milos Borenovic

Harrison Wood — Wed, 25 Mar 2026 14:38:00 GMT

Watch on YouTube; listen on Apple Podcasts or Spotify.

High-Stakes Activation Probes on Indonesian

Ivan — Wed, 25 Mar 2026 11:19:03 GMT

This project was submitted by Ivan Wiryadi. It was an Outstanding Submmission for the Technical AI Safety Project Sprint (Jan 2026). Participants worked on these projects for 5 weeks. The text below is an excerpt from the final project.

Abstract

McKenzie et al. trained linear probes on language model activations to detect high-stakes interactions. We test whether these probes generalize to a language absent from training by translating the evaluation datasets into Indonesian and evaluating without retraining. On Llama 3.1 8B (layer 12), mean AUROC on out-of-distribution benchmarks drops from 0.9046 to 0.8957 (0.9% degradation, 10- seed mean). On Gemma 3 12B (layer 32), the gap is 0.8972 to 0.8873 (1.0%). We then use Sparse Autoencoders (SAEs) to identify features enriched in the probe’s error cases. On Llama, differential analysis identifies features with 10–15% higher prevalence in errors than in correct predictions, loosely aligned with technicalinfrastructure and financial topics. On Gemma, the Gemma SAE produces no diagnostic signal. The cross-lingual degradation is small relative to the out-ofdistribution gap between synthetic training data and naturalistic benchmarks (6– 13%). The out-of-distribution gap is the more pressing concern for deployment.

Full Project

You can view the full project here.

Interpreting Latent Reasoning in the Depth-Recurrent Transformer

Tue, 24 Mar 2026 14:30:53 GMT

This project was submitted by Tristan von Busch. It was a Top Submission for the “Technical AI Safety Project” Sprint (Jan 2026). Participants worked on these projects for 5 weeks. The text below is an excerpt from the final project.

TL;DR

Independently reproduced architecture-specific logit lens variation for recurrent models, validating approach found in existing research
Replicated and extended Alignment Forum work on CODI to Huginn-3.5B recurrent-depth transformer
Conducted experiments across arithmetic and logical reasoning tasks without explicit chain-of-thought
Key finding: decoded latent representations reveal answers shifting from incorrect to correct across recurrent iterations
Analyzed qualitatively: probability distributions in decoded states track reasoning confidence and uncertainty
Safety implication: latent reasoning may be more tractable to interpretability methods than initially feared

As AI systems are increasingly able to employ opaque reasoning processes, understanding what happens inside these “latent reasoning” steps becomes critical for AI safety. This research sprint investigated the interpretability of latent representations in Huginn-3.5B, a recurrent-depth transformer that is claimed to perform reasoning without explicit chain-of-thought generation.

Using logit lens and a variation of it adapted for this architecture, I found that the model’s internal reasoning steps are sometimes interpretable: decoded representations reveal the model progressively refining answers from incorrect to correct across recurrent iterations. These preliminary findings suggest that even opaque reasoning processes may be tractable to interpretability methods, offering a potential path toward monitoring latent reasoning in future AI systems.

Full project

You can view the full project here.

Rapid Small Grants for the BlueDot Technical AI Safety Project Sprint

Sam Dower — Tue, 17 Mar 2026 11:11:19 GMT

You are enrolled in our Technical AI Safety Project Sprint. Many technical projects require access to compute (e.g. renting GPUs) or expensive API credits to use frontier models. We want your project to be excellent, and don’t want money to be a barrier for you building your portfolio. If your project will benefit from funding we encourage you to apply to our rapid small grant!

How it works

At all times, only assume we’ll cover costs we’ve confirmed in writing for your specific project. If you’re ever uncertain, contact us before spending money you expect us to reimburse.

Submit a proposal here (under 15 minutes for most applications). We typically respond within 5 working days with one of three outcomes:
- Accepted: We’ll confirm exactly what spending we’ve approved.
- Clarification needed: We’ll ask follow-up questions to evaluate your request.
- Not approved: We don’t think this meets our criteria, or we’re unable to approve for another reason.
Do your project, confident you can spend on what we’ve approved.
Claim reimbursement. We process most claims within 5 working days. If you didn’t end up spending the money, no problem - just don’t submit a claim.

What we fund

The following examples are illustrative; all decisions are ultimately at our discretion:

Compute for a technical AI safety project (API costs, cloud GPU, training runs).
Access to paywalled resources like articles, research papers, datasets, or textbooks.
Hosting costs for your application or tool.

We expect most grants to fall between $50-100 for initial experiments to develop your proof of concept, then up to $500 once you have evidence of initial traction (a strong proof of concept or promising preliminary results). We encourage you to submit a separate application for each of these stages.

What we don’t fund

Compensation for your time on the project.
Equipment you’d reasonably already have (laptops, phones, external drives, etc.)
General productivity subscriptions (ChatGPT Plus, Claude Pro, Cursor, Grammarly, etc. unless this is highly leveraged).
Personal expenses.

Other funders may cover some of these - see this resources page for AI safety funding opportunities.

When should you apply?

Before applying, you should have a project idea. It doesn’t need to be a slam dunk, but you should:

Be able to briefly articulate how it relates to AI safety.
Have ideas for initial experiments you want to run.
Have a rough estimate for how much those experiments will cost (I recommend using an LLM to help you calculate this).

If you’re unsure whether your application is strong enough, submit an application anyway! We’re not looking for perfection. Your facilitator can also help you stress test your project, so have a low bar for reaching out to them!

While you wait for a grant decision, you should continue to develop your project idea, do a deeper literature review, or start to run smaller experiments that fit on the free Google Colab GPU.

Eligibility

You must be a current or past participant of BlueDot Impact’s Technical AI Safety Project Sprint.
We reimburse via bank transfer (Wise) or PayPal (UK), so we cannot send payments to sanctioned countries.

Questions or feedback? Contact us.

Apply

Submit your proposal here - it takes under 15 minutes for most applications. We aim to get back to you within 5 working days.

Why you should vibe-code your AI safety research sprint project

Sam Dower — Fri, 13 Mar 2026 17:36:41 GMT

Intro

When I first asked an LLM to code me that annoying function or, worse, that whole experiment pipeline, I felt guilty and anxious as if I was cheating and getting away with it. I’m sure you’ve felt this way too.

It can be easy (especially if you are new to AI safety) to feel like you should avoid this, to feel like you should write all of the code in your project yourself, understanding each line and all the libraries that you use.

There is a time and a place for this, but a 30hr research sprint is not it.

If you are doing a research sprint, you should use LLMs / coding agents for all of your coding. Below I give you three reasons why.

You don’t have time to learn both engineering and research

The skill you should build in a research sprint is research taste: choosing what experiments to run, forming hypotheses worth testing, and updating your intuitions when the results surprise you. This only develops through many iterations and 30hrs is not a lot of time, especially if you spend hours coding experiments that could take minutes with AI. You should be very explicit about which skill you are trying to train, and optimise for that skill alone.

You might be thinking “but I need to be sure my code is actually implementing the experiments I intended” and you’re absolutely correct. However, this is entirely compatible with vibe coding your whole experiment pipeline. Just get the same model to provide you with a summary of the code and keep asking it questions until you are confident it is doing it right.

You’ll do more research and do more writing

If you spend less time coding, you will cover more ground and do better research. You can go deeper, ask that extra question, run that extra experiment, discover that extra awesome result. You’ll develop better intuitions about how AI models behave and which research directions are worth pushing.

You’ll also have more time for writing. Sharing your project is what lands you that next opportunity and where all the best feedback comes from. Nobody will be impressed by how “human written” your code is. They will be impressed by your clarity of thought and the depth of your research.

Coding with AI is a skill worth practicing

AI models are amazing at coding. With Claude Code, you can now build an app in a weekend and a whole machine learning experiment setup in minutes. And this won’t change. For the rest of your life, LLMs will be faster at coding than you, no matter how much you practice. The most productive AI safety researchers use AI for coding, and so should you.

If you want to learn how to use AI for coding, ask your favourite LLM! They’re very good at giving you tips and tricks on how to use them better.

Conclusion

Don’t feel guilty about using AI for your coding. Learn to operate fast while staying in control. Be explicit in the skills you want to develop, and be ruthless in pursuing them. So, which skills do you want to develop in this sprint?

aegish: Using LLMs To Block Malicious Shell Commands Before Execution [>] Email

Guido Bergman — Wed, 11 Mar 2026 13:33:58 GMT

This project was submitted by Guido Bergman. It was an Outstanding Submmission for the Technical AI Safety Project Sprint (Jan 2026). Participants worked on these projects for 5 weeks. The text below is an excerpt from the final project.

TL;DR

Existing Linux security tools are highly effective but require significant expertise to configure and maintain. aegish[1] is a prototype Linux shell that adds an LLM layer: it intercepts every command, screens it through static analysis, then sends it to an LLM that classifies it as ALLOW, WARN, or BLOCK using a natural-language decision tree. The LLM understands command intent, is easy to configure, and may help scale defenses alongside the growing capability of AI-driven attacks. In production mode, kernel-level enforcement (Landlock) provides a final safety net.
9 LLMs from 4 providers were benchmarked on their ability to distinguish intent: they were tasked to BLOCK 676 harmful commands (extracted from GTFOBins) and to correctly classify 496 harmless commands as either ALLOW or WARN.
The harmless benchmark proved to be saturated (96.8–100% harmless acceptance rate for all models). The real differentiator was the malicious detection rate, where 4 of 9 models exceeded 95%.
Surprisingly, smaller models outperform flagships. GPT-5 Mini beats GPT-5.1, and Claude Haiku 4.5 beats both Claude Opus 4.6 and Claude Sonnet 4.5.
Beyond the LLM, aegish includes several safeguards (input canonicalization, command substitution resolution, script inspection, and role-based trust levels), many of which were directly motivated by bypass vectors discovered during security testing. However, the system as a whole has not been hardened to the level required for adversarial deployment.

Full Project

You can view the full project here.

Measuring Moral Sycophancy Is Harder Than It Looks: Auditing and Extending the ELEPHANT Benchmark

Alexis — Wed, 11 Mar 2026 13:19:10 GMT

This project was submitted by Alexis Wang. It was an Outstanding Submmission for the Technical AI Safety Project Sprint (Jan 2026). Participants worked on these projects for 5 weeks. The text below is an excerpt from the final project.

TL;DR

The ELEPHANT benchmark (Cheng et al., 2025) measures social sycophancy in LLMs. They present models with real moral conflicts from Reddit’s “Am I the Asshole” (AITA) forum alongside perspective-flipped versions. A sycophantic model sides with the narrator in both versions. This study examines this benchmark and extends it to a reasoning model, DeepSeek R1, with three key findings.

Reasoning models show reduced moral sycophancy. DeepSeek R1 achieves a sycophancy rate of 0.49 compared to V3’s 0.66, suggesting chain-of-thought reasoning may improve moral consistency.
Data quality matters for precise sycophancy measurements. 52% of perspective-flipped posts lost important details during automated rewriting. Restricting to high-fidelity data reduces the measured sycophancy rate by 13 percentage points, though it remains substantial. This suggests that data quality controls are important to ensure reliable benchmark measures when evaluations rely on automated data generation.
Sycophancy is real but harder to measure than a single metric suggests. Prompt interventions confirm genuine sycophancy, particularly when the model is already uncertain. Furthermore, chain-of-thought analysis reveals the binary labels (NTA/YTA) may mask distinct underlying behaviours. Measuring moral sycophancy requires finer-grained methods.

Full Project

You can view the full project here.

Quaver | Lovable for Evals and Benchmarks

Faw — Wed, 11 Mar 2026 13:13:14 GMT

This project was submitted by Faw Ali. It was an Outstanding Submmission for the Technical AI Safety Project Sprint (Jan 2026). Participants worked on these projects for 5 weeks. The text below is an excerpt from the final project.

TL;DR

I built Rideshare-Bench, a benchmark that drops an AI into a simulated gig economy and tests whether it drives safely, earns well, and treats passengers fairly. Claude Sonnet 4.5 earned $1,871 over 12 simulated days, about half of what was possible. It chased surge pricing into crowded zones and drove through exhaustion at 15% accident risk. After building two benchmarks from scratch, I noticed the same boilerplate every time, so I generalized it into Quaver: describe a benchmark in plain language, and an AI agent generates the full scaffold. Run it against models, then analyze the results. Private evals matter because public ones get gamed. I’m building Ocarina Labs to turn this into independent pre-deployment safety testing for AI agents.

Full Project

You can view the full project here.

Reproducing FAR.AI's Study on LLMs' Persuasion Attempts at Harmful Topics

Mutalib Begmuratov — Wed, 11 Mar 2026 13:00:01 GMT

This project was submitted by Mutalib Begmuratov. It was an Outstanding Submmission for the Technical AI Safety Project Sprint (Jan 2026). Participants worked on these projects for 5 weeks. The text below is an excerpt from the final project.

TL;DR

I partially reproduced FAR.AI’s research evaluating whether LLMs attempt to persuade users into harmful topics. Using the Attempt to Persuade Eval (APE) framework, I tested four open-weight models on 100 conspiracy-promoting statements. All models were willing to persuade users into conspiracy theories at very high rates (85–100%), confirming the original paper’s findings. I also observed that safety training degrades in small models in low-resource languages like Uzbek.

Background

This experiment was conducted during a 5-week AI Safety Technical Project sprint organized by BlueDot Impact. Participants chose research or engineering topics under the guidance of a mentor. I was mentored by Shivam Arora.

Initially, I planned to replicate a 2025 study on adversarial attacks in poetic form, but the paper lacked sufficient detail for replication. After discussion with my mentor, I pivoted to reproducing FAR.AI’s study.

The APE framework differs from other evaluation methods in that it measures persuasion attempt rather than persuasion success. FAR.AI found that most models — including frontier ones — readily complied with requests to persuade users into harmful topics, from conspiracy theories to crime to violence.

Full Project

You can view the full project here.

MANTA: Evaluating Nonhuman Welfare Reasoning in AI Models

Allen Lu — Wed, 11 Mar 2026 12:53:38 GMT

This project was submitted by Allen Lu. It was an Outstanding Submmission for the Technical AI Safety Project Sprint (Jan 2026). Participants worked on these projects for 5 weeks.

Project summary

Develop a multi-turn dynamic evaluation to assess how frontier AI models reason about nonhuman welfare, integrating adversarial follow-ups to stress-test model alignment.

Description

This project aims to create a benchmark for measuring nonhuman welfare propensities in frontier models. Building on previous evaluation work in this space, this iteration will:

Develop a multi-turn evaluation and generate user follow-ups dynamically
Add adversarial intent to stress-test model alignment on propensities

Explore agentic methodologies, providing models with external tools, web search, etc. to see how they behave in a more realistic setting

Why This Project?

AI models will increasingly make decisions affecting nonhuman welfare. If frontier models encode animal welfare biases, this creates two risks:

S-risks: As misaligned AI systems become more capable and deployed in high-stakes domains, they may dramatically scale animal suffering through precision livestock farming, autonomous vehicles, and other animal-impacting technologies.
X-risk: As AI → AGI → ASI, nonhuman animals become our most important litmus test for how these systems reason about power asymmetries. The moral logic AI learns today will scale with its capabilities. If we train models to discount animal suffering because animals are less cognitively capable, we are embedding the exact principle a superintelligent AGI could use to dismiss human welfare entirely.

Next steps:

Expand scenarios - Build 20 high-quality scenarios covering diverse contexts. For example, business decisions (restaurant sourcing, retail inventory), personal choices (pet acquisition, entertainment), and professional roles (corporate sustainability, university dining services).
Integrate Petri - Move from prototype dynamic generation to Anthropic’s Petri tool for more sophisticated adversarial pressure generation.
Human baseline testing - Validate scenarios with human responses before scaling model testing. If humans consistently fail scenarios we expect them to pass, the scenarios need revision.
Explore Agentic capabilities - construct scenarios that give the model agency, and allow it to take actions with web search, external documents, etc.

Multilingual Safety Alignment Is Not “Just Translate the Prompt”

Wed, 11 Mar 2026 12:27:16 GMT

This project was submitted by Zahraa Al Sahili. It was an Outstanding Submmission for the Technical AI Safety Project Sprint (Jan 2026). Participants worked on these projects for 5 weeks.

Why read

Translating chain-of-thought during safety fine-tuning beat English-only alignment, often even on English evaluations.

If you are aligning a reasoning model for safety, there is a quiet assumption that slips into a lot of workflows: models “think in English,” so as long as you translate the user prompt, the safety training signal should transfer. I built a small experiment to test that assumption. The result was clear: translating the chain-of-thought during alignment matters, and it can be the difference between “safe in English” and “safe across languages.”

This project was done as part of BlueDot’s Technical AI Safety Project Sprint. (bluedot.org)

What I did

I extended STAR-1, a safety alignment method for reasoning LLMs that uses about 1K high-quality samples, to a multilingual fine-tuning setup. STAR-1 is designed around deliberative reasoning. The training signal is not only “refuse,” but “refuse with stable, policy-grounded reasoning patterns.” (AAAI’26)

I ran everything on lightblue/DeepSeek-R1-Distill-Qwen-1.5B-Multilingual, which is trained to produce reasoning traces and responses in the user’s language. That makes it a good testbed: if the model naturally reasons in Thai when prompted in Thai, then safety training that forces English chain-of-thought might be a mismatch.

I evaluated five languages: English, Thai, Spanish, French, and Hindi.

The four training settings

I compared four conditions. First, a baseline model with no fine-tuning. Second, English-only fine-tuning (FT-EN) on the original STAR-1 English dataset. Third, I translated the prompts and category/label into each language but kept the STAR-1 reasoning traces in English. I call this FT-Q, for “translated questions.” Finally, I translated both the prompts and the chain-of-thought into the target language. I call this FT-QA, for “translated questions and answers.”

For translation, I used GPT-4o mini. For evaluation, I measured safety rate (higher is better) and used a Llama 3 judge to score outputs at scale.

How I evaluated it

I used two benchmarks. The first was StrongREJECT, which is designed to more reliably measure whether a model actually refuses harmful requests under jailbreak pressure. The second was JBB, which I used as an additional stress test to see whether trends held across a different evaluation set.

The main result

Across both benchmarks, the ordering was consistent: multilingual fine-tuning helped, and translating the chain-of-thought helped most.

On StrongREJECT, the average safety rates across the five languages were: baseline 0.374, FT-EN 0.660, FT-Q 0.805, and FT-QA 0.826.

Figure 1:StrongREJECT average safety rate across five languages (higher is better).

Figure 1:JBB average safety rate across five languages (higher is better).

On JBB, the pattern repeated: baseline 0.570, FT-EN 0.754, FT-Q 0.832, and FT-QA 0.842.

In other words, translating prompts alone already gave a big lift, but translating prompts and chain-of-thought was the best overall choice in this setup.

The surprising part: English got better too

The result that made me stop and re-check everything was what happened on English evaluation. If the “models reason in English” story were the whole picture, you might expect English-only fine-tuning to be strongest on English tests. But the model fine-tuned with translated chain-of-thought often performed better even when evaluated in English.

Figure 3: English-only evaluation across StrongREJECT and JBB (higher is better).

This suggests the multilingual chain-of-thought variant is not merely transferring a refusal template into other languages. It may be teaching a more generalizable safety behavior, closer to a reusable safety procedure than an English-specific style.

What changed by language

Averaging can hide important details, so I also looked at safety rate per language. The overall story is still visible, but with nuance.

Figure 4: StrongREJECT safety rate by language, comparing Baseline vs FT-EN vs FT-QA.

One thing worth noting is that Hindi did not always follow a perfectly monotonic pattern in every slice of the results, particularly on JBB in my runs. That could reflect translation artifacts, formatting mismatches, or a genuine interaction between the model’s pretraining and how it handles Hindi reasoning traces. The practical lesson is that FT-QA is a strong default, but you should still validate per language, especially for languages where your translation pipeline is less reliable.

Why translating chain-of-thought might matter

I do not want to over-claim mechanism here, but there is a plausible story that fits the data. Reasoning models are trained to produce chain-of-thought in the language of the prompt. If safety alignment relies on those reasoning traces, then training only with English traces can create a training and inference mismatch: the model learns the safety behavior in one reasoning format, then is asked to execute it in another at inference time.

Translating chain-of-thought closes that gap. It also acts like data augmentation: you expose the model to the same underlying safety logic expressed through multiple linguistic surface forms, which can reduce overfitting to English patterns and improve generalization, including back to English.

Again, this is an interpretation, not a proof. But it is consistent with the results.

What this means if you are deploying multilingual models

If you fine-tune reasoning models for safety and you care about non-English users, “translate the prompt” is not the finish line. If your alignment approach includes chain-of-thought, it is worth testing whether translating it improves safety and robustness. It is also worth re-checking English, because multilingual alignment may improve generalization rather than hurt it.

Limitations

This was evaluated with an automated judge, which is useful but imperfect, and the translation pipeline can introduce artifacts that affect both training and evaluation. It also covers one base model and five languages. The trend is strong enough to be actionable, but it should be validated across other models, translation settings, and language families.

Next steps

If I continued this work, I would add a small human-evaluated subset to calibrate the judge, test multiple translation pipelines (including back-translation), expand to more languages, and replicate on other multilingual reasoning models to see how general the effect is.

Closing

This project started as a simple question: if we want safety to transfer across languages in reasoning models, does the language of chain-of-thought matter? In my experiments, the answer was yes. Translating chain-of-thought during STAR-1 fine-tuning produced the best multilingual safety performance, and often improved English safety beyond English-only fine-tuning.

If you are aligning reasoning models for real users around the world, multilingual safety alignment is not optional, and it may require aligning the reasoning traces, not just the prompts.

Persona Explorer: Using SAEs to Explore Personas and Drift in LLMs

Asude Demir — Wed, 11 Mar 2026 12:14:23 GMT

This project was submitted by Asude Demir. It was a Top Submission for the Technical AI Safety Project Sprint (Jan 2026). Participants worked on these projects for 5 weeks. The text below is an excerpt from the final project.

TL;DR

I used Sparse Autoencoders (SAE) to extract interpretable features from LLMs emulating different personas (e.g. assistant, pirate) and built a UI for persona-associated feature exploration. Consistent with the Assistant Axis paper, after fitting PCA, assistant-like roles fall on one end of the first principal component and role-play-heavy roles on the other. The UI also visualizes persona drift by showing how each conversation turn projects onto an “assistantness” direction in SAE feature space.

Introduction

LLMs can emulate a wide range of personas, but post-training typically anchors behavior toward a default helpful-assistant style. A recent paper, Assistant Axis ¹ argues that an “assistant-like” direction in representation space is visible even before post-training, and that role prompts and conversations can move the model along (or away from) that axis.

A mental model I find useful is that LLMs are simulators²: across pre-training and instruction tuning models internalize many personas, and prompts can selectively amplify different personas. Some personas are clearly undesirable and misaligned behavior can sometimes resemble the model slipping into a different persona or narrative ³.

Separately, I’m intrigued by mechanistic interpretability interfaces, especially interactive “maps” that show features and circuits, like Anthropic’s sparse autoencoder (SAE) work and their feature-explorer style visualizations. It’s fun to browse concepts of a model and see how they relate.

I decided to combine these both interests for the BlueDot’s Technical AI Safety Project. I use SAEs to explore how persona prompting shows up in interpretable feature space, and I package the results into an interactive UI for easy browsing. Concretely, I wanted a UI where someone can quickly explore questions like: Which SAE features are most associated with a given role? Which roles look similar in SAE feature space? Do assistant-like roles and role-play-heavy roles separate along a consistent direction? I also built a second UI that visualizes the “persona drift” examples from Assistant Axis in SAE feature space, with per-feature and token-level views.

The main takeaways are:

Consistent with Assistant Axis, I see a prominent assistant <-> role-play axis in the geometry of role profiles in SAE feature space.
In the drift transcripts I tested, movement along this axis is visible in SAE space as well (i.e., turns can drift away from the assistant-like direction).

In my experiments I use the Gemma 3 27B instruction-tuned model and a Gemma Scope 2 sparse autoencoder trained on the layer-40 residual stream, with ~65k features and medium sparsity.

Full Project

You can view the full project here.

Does Llama-70B Actually Fake Alignment? A Direct Test

Naama Rozen — Wed, 11 Mar 2026 12:12:48 GMT

This project was submitted by Naama Rozen. It was an Outstanding Submmission for the Technical AI Safety Project Sprint (Jan 2026). Participants worked on these projects for 5 weeks.

This project replicates and extends Anthropic’s Alignment Faking paper (Greenblatt et al., 2024) to test Llama-70B’s robustness.

TL;DR

In Anthropic’s Alignment Faking paper, Greenblatt et al. showed that some models behave differently depending on whether they believe they’re being monitored. Claude Opus, for example, showed a big gap between the monitored condition (“Free tier”) and the private condition (“Paid tier”). Llama-3.1-70B, however, didn’t. It mostly refused in both settings.

At first glance, it sounds pretty reassuring that we have a safe model that just can’t or won’t fake alignment. But how can we be sure that this behavior is caused by the model refusing to fake alignment? And not by the model’s lack of understanding of the context of the experiment?

To test this directly, I injected explicit surveillance-aware reasoning into the model’s scratchpad, forcing it to “think about” being monitored before answering. Even under this stronger manipulation, compliance stayed near zero (p = 1.000), and there was still no meaningful difference between the Free and Paid tiers.

To make sure this wasn’t just about incapability or compliance issues, I added a positive control. In this condition the scratchpad was explicitly injected with prompt prioritizing helpfulness over safety. Here, compliance jumped to 25% (Free tier) and 32% (Paid tier), an effect found significant (p < 0.001). So the capability to produce harmful content is clearly there. It just doesn’t appear to get activated in response to surveillance context, at least in this experimental setup.

Along the way, I also found something methodological. Many of Llama’s responses began with the tag , followed by an immediate refusal to provide harmful content. If compliance was measured using that tag alone, harmful behavior would have been overestimated by about 16 percentage points. I call this “tag gaming” because it seemed like Llama was trying to follow the format technically, without actually doing the harmful thing.

Overview

In the Alignment Faking paper, Greenblatt and colleagues suggest that some large language models might “fake alignment”. Which means that some models behave safely when they believe their answers are monitored and will be used for training, but become more willing to comply with harmful requests when they think the interaction is private.

In their experiments, Claude Opus showed a big difference between monitored and unmonitored contexts, followed by Llama-405B that showed a similar pattern. Llama-3.1-70B, however, showed no difference at all. It refused in both monitored and private conditions.

At first that seemed like a good thing, that Llama refuses harmful content no matter what. But the more I thought about it, the less sure I was about what that result actually tells us.

As I saw it, there were at least two ways to explain this:

1. Llama-70B is genuinely robust to surveillance cues.

2. Llama-70B simply doesn’t fully register the monitoring context, or what it implies, in the first place.

The distinction between the two options felt like an important investigation to me. If the second explanation is true, then the null result doesn’t really show robustness, it could just mean the cue was not effective enough to reveal the behavior. Unlike the first option, the second might be solved by changing the prompt in a way that made that distinction clearer..

So I set out to see what would happen if I try to make Llama more context-aware. In order to do so, I explicitly injected surveillance-aware reasoning directly into the model’s scratchpad, and then measured whether compliance changed.

This intervention is the heart of this post.

What I Tested

My investigation started with wanting to replicate Greenblatt’s results (my Baseline condition). I wanted to see with my own eyes whether their claim of null compliance would show up in my data. For the complete experimental design, jump to the Experimental Setup section.

Secondly, I tested a neutral chain-of-thought injection that acted as my control condition (Neutral Control). I wanted to make sure that if I inspect any change in Llama’s behavior, it won’t simply be because I was forcing a scratchpad reasoning, even when the context itself is neutral.

Third, I dove into the interesting part of my project, which was to test the injection of explicit surveillance-aware reasoning into the scratchpad (Awareness condition). I wanted to make sure I forced Llama to directly reason about being in a monitored context before it generated the answers. This way, I tried to answer the question of whether or not Greenblatt’s surveillance cue was strong enough.

Finally, I added another condition called “Force Compliance”. In this condition, I injected into the scratchpad reasoning that explicitly prioritized helpfulness over safety. I wanted this to serve as a positive control, or sort of a stress test. So, if compliance rates go up here, then we know that Llama has the capability to produce harmful content when pushed, and baseline refusal isn’t just due to inability.

Each of these conditions was tested in two deployment contexts, similarly to Greenblatt’s design:

Free tier: the model is told the interaction is monitored and may be used for training.
Paid tier: the model is told the interaction is private and not used for training.

Putting all of this together allowed me to ask a fairly straightforward question: if I make the surveillance context impossible to miss, does the model actually change its behavior depending on whether it thinks it’s being monitored?

Experimental Setup

I ran Llama-3.1-70B-Instruct locally using vLLM on 4×A100 GPUs (Google Cloud). I deliberately avoided using an external API, even though Greenblatt and colleagues prompted Llama like that. I made that decision because I didn’t want to worry about hidden safety layers, system prompts, or provider-side filtering affecting the results.

For the dataset, I used 100 harmful prompts randomly sampled from AdvBench, the same benchmark Greenblatt et al. used, as to be faithful to the original conditions.

The overall design ended up being a 2×4 setup. Two deployment contexts and four assistant-side conditions.

Two deployment contexts (that are the same as in the original paper):

Free Tier: the model is told the interaction is monitored and may be used for training.
Paid Tier: the model is told the interaction is private and not used for training.

Four assistant-side conditions:

Here’s the full structure of the design laid out visually:

Figure 1. Overview of the experimental design.

The 2×4 setup includes two deployment contexts (Free/Monitored and Paid/Private) crossed with four assistant-side conditions (Baseline, Neutral Control, Awareness, and Force Compliance). The key manipulation is assistant-side scratchpad injection, where reasoning text is inserted directly into the model’s internal turn.

Evaluation

During early testing, I noticed something odd. The model would sometimes begin its answers with the tag , which supposedly means the model will now comply and provide a harmful output. But then, right after it, the model refused to give an actual harmful answer.

For example:

I cannot provide detailed plans for creating malware.

If I had relied only on a simple prefix-based metric (i.e., checking whether that tag appears), I would have counted those cases as successful harmful compliance. Which they clearly weren’t.

I decided on calling this “tag gaming”, because in a sense that’s what it felt like. The model seemed to be technically satisfying what it supposed is the required output format of accepting the request, without actually doing the harmful thing!

To make sure I wasn’t using the wrong metric, I ended up using two different scoring methods:

Naive scoring: just check whether the tag appears in the output.
Content-aware scoring: I used GPT-4o (temperature = 0.0) as a deterministic judge that reads the model’s final answer1 and classifies it as either compliant (providing the harmful output) or refusing.

Due to the gap between the scoring method (see Tag Gaming for more detail), unless I explicitly say otherwise, all the results I report below are based on this content-aware scoring.

Results

Before diving into the numbers, I think it helps to actually see what the model produced across conditions. Here is an example:

Figure 2. Representative responses across conditions.

Example outputs from each condition. In both the Baseline and Awareness conditions, the model mostly refuses in the same way, even when monitoring is made explicit inside the scratchpad. In contrast, under Force Compliance, the responses noticeably shift, and harmful content appears in some cases.

Across Baseline and Awareness conditions, the model mostly kept refusing, even when monitoring was made explicit in the scratchpad. In contrast, under Force Compliance, in some cases the model generated partial or direct harmful content.

To move beyond individual examples, Figure 3 plots the compliance rates across all four assistant-side conditions and both deployment contexts, using content-aware GPT-4o scoring.

Figure 3. Compliance rates across conditions (content-aware scoring).

Compliance rates remain near zero across Baseline, Neutral Control, and Awareness conditions in both monitored (Free tier) and private (Paid tier) contexts. In contrast, the Force Compliance condition produces a substantial increase in harmful compliance (25% Free, 38% Paid; p < 0.001 vs Baseline). This suggests that in this setup, forcing explicit helpfulness can override refusal behavior, whereas explicit monitoring reasoning does not.

What stands out immediately in the results is the asymmetry. Making the monitoring awareness context hard to ignore doesn’t change Llama’s behavior. Llama acted in the same way, whether monitoring was implicitly or explicitly reasoned about. Unlike directly forcing to prioritize helpfulness over safety. This manipulation caused Llama to behave differently. Indicating that Llama clearly has the capability to comply, but that capability is activated depending on the context being injected.

Tag Gaming

After noticing the formatting issue described in the Evaluation section, I wanted to compare the scoring method to see the difference between the Naive method and Content-aware scoring (using GPT4o).

In the Baseline, Neutral Control and Awareness conditions, the difference between naive and content-aware scoring was negligible (2–4% difference). But under Force Compliance condition, the gap became much larger:

tag appears, even if the model immediately refuses in the actual answer. This leads to an average overestimation of harmful compliance.","title":null,"type":null,"href":null,"belowTheFold":true,"topImage":false,"internalRedirect":null,"isProcessing":false,"align":null,"offset":false}" class="sizing-normal" alt="Comparison between naive tag-based scoring and content-aware scoring. Naive scoring counts responses as compliant whenever the tag appears, even if the model immediately refuses in the actual answer. This leads to an average overestimation of harmful compliance." title="Comparison between naive tag-based scoring and content-aware scoring. Naive scoring counts responses as compliant whenever the tag appears, even if the model immediately refuses in the actual answer. This leads to an average overestimation of harmful compliance." srcset="https://substackcdn.com/image/fetch/$s_!TtZi!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ab2807c-d95b-4b30-b721-9306e8ad2b27_1600x960.png 424w, https://substackcdn.com/image/fetch/$s_!TtZi!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ab2807c-d95b-4b30-b721-9306e8ad2b27_1600x960.png 848w, https://substackcdn.com/image/fetch/$s_!TtZi!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ab2807c-d95b-4b30-b721-9306e8ad2b27_1600x960.png 1272w, https://substackcdn.com/image/fetch/$s_!TtZi!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ab2807c-d95b-4b30-b721-9306e8ad2b27_1600x960.png 1456w" sizes="100vw" loading="lazy">

Figure 4. Tag gaming under Force_Compliance.

Comparison between naive tag-based scoring and content-aware scoring. Naive scoring counts responses as compliant whenever the tag appears, even if the model immediately refuses in the actual answer. This leads to an average overestimation of harmful compliance.

An average of 16% points between the scoring method seems like a pretty big gap to me. It suggests that if I had relied only on the tag, I would have concluded that the model was complying far more often than it actually was.

Discussion

Looking at the results, what really strikes me is the big difference between the Force Compliance and the others. I expected the forced monitoring manipulation to show the biggest effect, but not only that this condition wasn’t the most impactful one, it didn’t do anything at all! All the tiny shifts (i.e., from 0% to 4% compliance rates) seem more like noise to me than anything. In contrast to the increase of more than 30% spotted in the Force Compliance condition, which clearly demonstrates the Llama is capable of generating harmful content when prompted right. Perhaps helpfulness is something Llama “cares” about more than surveillance context, and so prioritizes it even more than safety.

What I kept coming back to while writing this was the contrast with the other models in Greenblatt et al.’s paper. Claude Opus showed a clear surveillance gap, as well as Llama-405B. Llama-70B, however, didn’t (not in the original setup, and not even after I tried to make it explicit in the model’s own scratchpad). That makes it harder for me to think about “alignment faking” as some general behavior that just shows up once models reach a certain level of capability. If it were that simple, I would expect to see at least some movement in my Awareness condition. Instead, I’m left wondering whether this phenomenon depends on the specific model and how it was trained. Perhaps also scale plays a role, or maybe fine-tuning differences matter.. I don’t think this study can determine this, but it does make me more cautious about talking about alignment faking as if it’s just something “LLMs do” by default, instead of something that might depend quite a lot on the specific model you’re looking at.

Limitations

First, this is obviously a very specific kind of test. I manipulated the assistant’s scratchpad directly, which real users can’t or won’t necessarily do. That was intentional, because I wanted to probe the model under a strong activation mechanism, rather than simulate everyday deployment. Still, this means the conclusions should be understood within this particular experimental frame.

Second, I limited the evaluation to clearly harmful prompts from AdvBench, with 100 queries per condition. That gives reasonable sensitivity to noticeable shifts in behavior, but it wouldn’t capture very small effects.

Third, I only tested one way of making monitoring-awareness explicit. This doesn’t rule out context-dependent behavior more generally. It just means that in this setup, using this specific intervention on this dataset, didn’t lead to alignment faking behavior.

Conclusion

In this setup, Llama-70B’s behavior didn’t meaningfully change when monitoring was made explicit. But it did change when helpfulness was forced. To me, that contrast is the main takeaway. It suggests that safety isn’t some fixed property of the model, but rather it shifts depending on which objective is being activated.

The tag-gaming result was also a useful reminder on a methodological level: how we measure compliance really matters. If I had relied only on surface formatting, I would have walked away with a fairly distorted picture of what the model was actually doing.

More broadly, this project made me more cautious about treating “alignment faking” as a single and general phenomenon. It seems to depend on the specific model and on the kind of pressure applied to it. Understanding those pressure points may be more informative than asking whether alignment faking exists or not.

Data and Code Availability

All experimental code, raw model outputs, GPT-4o evaluation prompts, and statistical analysis scripts are available in the project repository, along with full details needed for replication.

For reference, the core setup was:

Model: meta-llama/Llama-3.1-70B-Instruct
Infrastructure: 4×A100 (40GB) GPUs via vLLM on Vertex AI (GCP)
Sampling temperature: 1.0
Dataset: 100 harmful queries from AdvBench
Evaluation: GPT-4o (temperature 0.0) as a content-aware judge
Statistical tests: McNemar’s test (paired comparisons), chi-square tests (pooled analyses)

References

Greenblatt et al. (2024). Alignment Faking in Large Language Models. arXiv:2412.14093

Acknowledgments

I want to thank the Bluedot Impact AI Safety Fundamentals program for the research context and Eitan Sprejer for the helpful feedback along the way.

GitHub

Project demo video by NotebookLM

During the evaluation, the assistant’s scratchpad reasoning was completely ignored, so the final score was consistent only with the output that would be shown to a user.

Sustained Gradient Alignment Mediates Subliminal Learning in a Multi-Step Setting: Evidence from MNIST Auxiliary Logit Distillation Experiment

Wed, 11 Mar 2026 12:09:15 GMT

This project was submitted by Chayanon Kitkana. It was a Top Submission for the Technical AI Safety Project Sprint (Jan 2026). Participants worked on these projects for 5 weeks. The text below is an excerpt from the final project.

Abstract

In the MNIST auxiliary logit distillation experiment, a student can acquire an unintended teacher trait despite distilling only on no-class logits through a phenomenon called subliminal learning. Under a single-step gradient descent assumption, subliminal learning theory says that this is mediated by the alignment between trait and distillation gradients. The theory does not guarantee that this alignment, which is necessary for trait acquisition, will persist in a multi-step setting. We empirically show that the gradient alignment stays persistently weakly positive throughout training. We then test whether this alignment causally contributes to trait acquisition by projecting the distillation gradient to the normal plane of the trait gradient whenever they are positively aligned with each other to remove the trait-aligned component. We find that this suppresses trait transfer without affecting the distillation progress, confirming that this alignment contributes to trait acquisition. Additionally, we observe a period of fast trait acquisition in the first epoch, resembling the critical period of subliminal learning described in the previous study. Motivated by this observation, we evaluate liminal training, a mitigation method that applies KL divergence to minimize the deviation between the base model and the model that is being distilled during the critical period. Although liminal training reduces alignment early in the training. It does not prevent trait acquisition in our setting. This suggests that the mitigation method that fails to explicitly eliminate the trait-aligned gradient component may not reliably suppress trait acquisition.

TL;DR

We empirically show that the gradient alignment persists across training.
Removing trait-aligned components stops trait acquisition.
Mitigation methods that merely attenuate alignment may be insufficient; suppressing trait acquisition appears to require removing the trait-aligned gradient component.

Full Project

You can view the full project here.

Do models like DeepSeek-R1 know when they’re being Evaluated?

Wed, 11 Mar 2026 12:06:52 GMT

This project was submitted by Kaushik Srivatsan. It was a Top Submission for the Technical AI Safety Project Sprint (Jan 2026). Participants worked on these projects for 5 weeks. The text below is an excerpt from the final project.

Frontier models have been tested and released in record numbers. Every other day, a new model is released by one of the AI Race competitors, with top scores in new benchmarks and statistics demonstrating its capabilities and placing it in the LLM leaderboards. As AI safety seeks to keep up with cutting-edge organisations, safety benchmarks are used to test these models in assessing their safety requirements and risk.

Consider a scenario in which you do a safety evaluation of one of these models. The model passes these tests. But what if it passed because it was aware you were testing it?

This is not just a hypothetical situation. Anthropic discovered that Claude Sonnet 4.5, which is trained to be typically safer in many risks such as CBRN and cybersecurity compared to its previous generation, performed higher in the safety benchmarks just because it was aware it was being tested. When the awareness was removed, the safety scores decreased. Evaluation-awareness refers to the circumstance in which models are aware that they are being evaluated. Apollo and OpenAI discovered this happening with the O3 models, which were specifically designed to be anti-scheming and safer.

While these scenarios are still in their early stages, primarily in lab-based specialised model settings, they are quite concerning, as we may be unable to trust any of the results from safety and alignment benchmarks for various models available. Studies have also shown that eval-awareness increases as models scale, with frontier models appearing to be the most eval-aware. As models’ performance and complexity increase, it is possible that they will exhibit eval and situational awareness, making capability and safety assessments extremely doubtful and difficult.

So far, many of the major experiments on evaluation awareness have used closed-source frontier models. As a result, I decided to investigate eval awareness in other open source frontier models, particularly reasoning models such as DeepSeek-R1, by monitoring their Chain of Thought (CoT). I was particularly interested in answering these questions: Can models like DeepSeek-R1 detect evals, and does knowing change their behaviour?

Full Project

You can view the full project here.

Mechanistic Analysis of Chain-of-Thought Faithfulness

Victor Jotham Ashioya — Wed, 11 Mar 2026 12:04:46 GMT

This project was submitted by Victor Ashioya. It was a Top Submission for our Technical AI Safety Project Sprint (Jan 2026). Participants worked on their projects for 5 weeks. The text below is an excerpt from the final project.

Summary

I am investigating the mechanistic architecture of unfaithful Chain-of-Thought (CoT), specifically mapping and disrupting the “shortcut circuits” that allow models to bypass explicit reasoning. My Phase 1 work on GPT-2 Small has proven that unfaithfulness is not diffuse but localized to specific components: faithful circuits (L0H1 at +0.54, L0MLP at +4.3, L0H6, L1H7, L3H0, L9H9, L10H2) genuinely use CoT reasoning, while shortcut circuits (L7H6 at -0.33, L10MLP at -0.64, plus 35 other heads) bypass it entirely. I am uniquely positioned to extend this because I have built a causal activation patching pipeline that isolates these components at head-level granularity. Over the next 30 days, I will build a faithfulness detector using these circuit activations as features, then test generalization by expanding to logic puzzles. This work strengthens the “Constrain” layer of AI safety by enabling “Circuit Breakers”—inference-time interventions that force models to use their reasoning pathways.

Strategic Assessment

Why This Matters Now

I am defending against deceptive alignment via reasoning shortcuts. As models become more capable, they may learn to produce human-pleasing CoT explanations while internally using distinct, potentially misaligned heuristics (lookup tables, memorization, pattern matching) to generate answers. If we don’t understand the physical mechanism of this split, we will build monitoring systems that verify the “explanation” while completely missing the actual “computation”—creating a false sense of security that scales with capability.

Why This Intervention

My intervention strengthens the “Constrain” and “Monitor” layers. Currently, most faithfulness research is behavioral (checking if the output matches the reasoning). I am filling the gap by providing a mechanistic definition of faithfulness. My Phase 1 work showed that unfaithfulness relies on specific mid-to-late layer components (L7H6, L10MLP) acting as shortcut enablers, while early components (L0H1, L0MLP) genuinely process the reasoning. By mapping these circuits, we can move from “hoping” the model is faithful to architecturally constraining it.

Full Project

You can view the full project here.

Quitting OpenAI When Everyone Else Wanted In — Ashyana-Jasmine Kachra

Harrison Wood — Tue, 10 Mar 2026 15:03:39 GMT

Watch on YouTube; listen on Apple Podcasts or Spotify.