High-Stakes Activation Probes on Indonesian

Mar 25, 2026

This project was submitted by Ivan Wiryadi. It was an Outstanding Submmission for the Technical AI Safety Project Sprint (Jan 2026). Participants worked on these projects for 5 weeks. The text below is an excerpt from the final project.

Abstract

McKenzie et al. trained linear probes on language model activations to detect high-stakes interactions. We test whether these probes generalize to a language absent from training by translating the evaluation datasets into Indonesian and evaluating without retraining. On Llama 3.1 8B (layer 12), mean AUROC on out-of-distribution benchmarks drops from 0.9046 to 0.8957 (0.9% degradation, 10- seed mean). On Gemma 3 12B (layer 32), the gap is 0.8972 to 0.8873 (1.0%). We then use Sparse Autoencoders (SAEs) to identify features enriched in the probe’s error cases. On Llama, differential analysis identifies features with 10–15% higher prevalence in errors than in correct predictions, loosely aligned with technicalinfrastructure and financial topics. On Gemma, the Gemma SAE produces no diagnostic signal. The cross-lingual degradation is small relative to the out-ofdistribution gap between synthetic training data and naturalistic benchmarks (6– 13%). The out-of-distribution gap is the more pressing concern for deployment.

Full Project

You can view the full project here.

A guest post by

Ivan

n/a

BlueDot Impact

Discussion about this post

Ready for more?