Interpreting Latent Reasoning in the Depth-Recurrent Transformer

Mar 24, 2026

This project was submitted by Tristan von Busch. It was a Top Submission for the “Technical AI Safety Project” Sprint (Jan 2026). Participants worked on these projects for 5 weeks. The text below is an excerpt from the final project.

TL;DR

Independently reproduced architecture-specific logit lens variation for recurrent models, validating approach found in existing research
Replicated and extended Alignment Forum work on CODI to Huginn-3.5B recurrent-depth transformer
Conducted experiments across arithmetic and logical reasoning tasks without explicit chain-of-thought
Key finding: decoded latent representations reveal answers shifting from incorrect to correct across recurrent iterations
Analyzed qualitatively: probability distributions in decoded states track reasoning confidence and uncertainty
Safety implication: latent reasoning may be more tractable to interpretability methods than initially feared

As AI systems are increasingly able to employ opaque reasoning processes, understanding what happens inside these “latent reasoning” steps becomes critical for AI safety. This research sprint investigated the interpretability of latent representations in Huginn-3.5B, a recurrent-depth transformer that is claimed to perform reasoning without explicit chain-of-thought generation.

Using logit lens and a variation of it adapted for this architecture, I found that the model’s internal reasoning steps are sometimes interpretable: decoded representations reveal the model progressively refining answers from incorrect to correct across recurrent iterations. These preliminary findings suggest that even opaque reasoning processes may be tractable to interpretability methods, offering a potential path toward monitoring latent reasoning in future AI systems.

Full project

You can view the full project here.

BlueDot Impact

Discussion about this post

Ready for more?