This project was submitted by Ivan Wiryadi. It was an Outstanding Submmission for the Technical AI Safety Project Sprint (Jan 2026). Participants worked on these projects for 5 weeks. The text below is an excerpt from the final project.
Abstract
McKenzie et al. trained linear probes on language model activations to detect high-stakes interactions. We test whether these probes generalize to a language absent from training by translating the evaluation datasets into Indonesian and evaluating without retraining. On Llama 3.1 8B (layer 12), mean AUROC on out-of-distribution benchmarks drops from 0.9046 to 0.8957 (0.9% degradation, 10- seed mean). On Gemma 3 12B (layer 32), the gap is 0.8972 to 0.8873 (1.0%). We then use Sparse Autoencoders (SAEs) to identify features enriched in the probe’s error cases. On Llama, differential analysis identifies features with 10–15% higher prevalence in errors than in correct predictions, loosely aligned with technicalinfrastructure and financial topics. On Gemma, the Gemma SAE produces no diagnostic signal. The cross-lingual degradation is small relative to the out-ofdistribution gap between synthetic training data and naturalistic benchmarks (6– 13%). The out-of-distribution gap is the more pressing concern for deployment.
Full Project
You can view the full project here.


