Sustained Gradient Alignment Mediates Subliminal Learning in a Multi-Step Setting: Evidence from MNIST Auxiliary Logit Distillation Experiment
This project was submitted by Chayanon Kitkana. It was a Top Submission for the Technical AI Safety Project Sprint (Jan 2026). Participants worked on these projects for 5 weeks. The text below is an excerpt from the final project.
Abstract
In the MNIST auxiliary logit distillation experiment, a student can acquire an unintended teacher trait despite distilling only on no-class logits through a phenomenon called subliminal learning. Under a single-step gradient descent assumption, subliminal learning theory says that this is mediated by the alignment between trait and distillation gradients. The theory does not guarantee that this alignment, which is necessary for trait acquisition, will persist in a multi-step setting. We empirically show that the gradient alignment stays persistently weakly positive throughout training. We then test whether this alignment causally contributes to trait acquisition by projecting the distillation gradient to the normal plane of the trait gradient whenever they are positively aligned with each other to remove the trait-aligned component. We find that this suppresses trait transfer without affecting the distillation progress, confirming that this alignment contributes to trait acquisition. Additionally, we observe a period of fast trait acquisition in the first epoch, resembling the critical period of subliminal learning described in the previous study. Motivated by this observation, we evaluate liminal training, a mitigation method that applies KL divergence to minimize the deviation between the base model and the model that is being distilled during the critical period. Although liminal training reduces alignment early in the training. It does not prevent trait acquisition in our setting. This suggests that the mitigation method that fails to explicitly eliminate the trait-aligned gradient component may not reliably suppress trait acquisition.
TL;DR
We empirically show that the gradient alignment persists across training.
Removing trait-aligned components stops trait acquisition.
Mitigation methods that merely attenuate alignment may be insufficient; suppressing trait acquisition appears to require removing the trait-aligned gradient component.
Full Project
You can view the full project here.
