Mechanistic Analysis of Chain-of-Thought Faithfulness

Mar 11, 2026

This project was submitted by Victor Ashioya. It was a Top Submission for our Technical AI Safety Project Sprint (Jan 2026). Participants worked on their projects for 5 weeks. The text below is an excerpt from the final project.

Summary

I am investigating the mechanistic architecture of unfaithful Chain-of-Thought (CoT), specifically mapping and disrupting the “shortcut circuits” that allow models to bypass explicit reasoning. My Phase 1 work on GPT-2 Small has proven that unfaithfulness is not diffuse but localized to specific components: faithful circuits (L0H1 at +0.54, L0MLP at +4.3, L0H6, L1H7, L3H0, L9H9, L10H2) genuinely use CoT reasoning, while shortcut circuits (L7H6 at -0.33, L10MLP at -0.64, plus 35 other heads) bypass it entirely. I am uniquely positioned to extend this because I have built a causal activation patching pipeline that isolates these components at head-level granularity. Over the next 30 days, I will build a faithfulness detector using these circuit activations as features, then test generalization by expanding to logic puzzles. This work strengthens the “Constrain” layer of AI safety by enabling “Circuit Breakers”—inference-time interventions that force models to use their reasoning pathways.

Strategic Assessment

Why This Matters Now

I am defending against deceptive alignment via reasoning shortcuts. As models become more capable, they may learn to produce human-pleasing CoT explanations while internally using distinct, potentially misaligned heuristics (lookup tables, memorization, pattern matching) to generate answers. If we don’t understand the physical mechanism of this split, we will build monitoring systems that verify the “explanation” while completely missing the actual “computation”—creating a false sense of security that scales with capability.

Why This Intervention

My intervention strengthens the “Constrain” and “Monitor” layers. Currently, most faithfulness research is behavioral (checking if the output matches the reasoning). I am filling the gap by providing a mechanistic definition of faithfulness. My Phase 1 work showed that unfaithfulness relies on specific mid-to-late layer components (L7H6, L10MLP) acting as shortcut enablers, while early components (L0H1, L0MLP) genuinely process the reasoning. By mapping these circuits, we can move from “hoping” the model is faithful to architecturally constraining it.

Full Project

You can view the full project here.

A guest post by

Victor Jotham Ashioya

AI Research mostly with a bit of philosophy, and literature.

BlueDot Impact

Discussion about this post

Ready for more?