Persona Explorer: Using SAEs to Explore Personas and Drift in LLMs

Mar 11, 2026

This project was submitted by Asude Demir. It was a Top Submission for the Technical AI Safety Project Sprint (Jan 2026). Participants worked on these projects for 5 weeks. The text below is an excerpt from the final project.

TL;DR

I used Sparse Autoencoders (SAE) to extract interpretable features from LLMs emulating different personas (e.g. assistant, pirate) and built a UI for persona-associated feature exploration. Consistent with the Assistant Axis paper, after fitting PCA, assistant-like roles fall on one end of the first principal component and role-play-heavy roles on the other. The UI also visualizes persona drift by showing how each conversation turn projects onto an “assistantness” direction in SAE feature space.

Introduction

LLMs can emulate a wide range of personas, but post-training typically anchors behavior toward a default helpful-assistant style. A recent paper, Assistant Axis ¹ argues that an “assistant-like” direction in representation space is visible even before post-training, and that role prompts and conversations can move the model along (or away from) that axis.

A mental model I find useful is that LLMs are simulators²: across pre-training and instruction tuning models internalize many personas, and prompts can selectively amplify different personas. Some personas are clearly undesirable and misaligned behavior can sometimes resemble the model slipping into a different persona or narrative ³.

Separately, I’m intrigued by mechanistic interpretability interfaces, especially interactive “maps” that show features and circuits, like Anthropic’s sparse autoencoder (SAE) work and their feature-explorer style visualizations. It’s fun to browse concepts of a model and see how they relate.

I decided to combine these both interests for the BlueDot’s Technical AI Safety Project. I use SAEs to explore how persona prompting shows up in interpretable feature space, and I package the results into an interactive UI for easy browsing. Concretely, I wanted a UI where someone can quickly explore questions like: Which SAE features are most associated with a given role? Which roles look similar in SAE feature space? Do assistant-like roles and role-play-heavy roles separate along a consistent direction? I also built a second UI that visualizes the “persona drift” examples from Assistant Axis in SAE feature space, with per-feature and token-level views.

The main takeaways are:

Consistent with Assistant Axis, I see a prominent assistant <-> role-play axis in the geometry of role profiles in SAE feature space.
In the drift transcripts I tested, movement along this axis is visible in SAE space as well (i.e., turns can drift away from the assistant-like direction).

In my experiments I use the Gemma 3 27B instruction-tuned model and a Gemma Scope 2 sparse autoencoder trained on the layer-40 residual stream, with ~65k features and medium sparsity.

Full Project

You can view the full project here.

A guest post by

Asude Demir

Researcher working on AI safety problems, with a particular interest in mechanistic interpretability.

BlueDot Impact

Discussion about this post

Ready for more?