Overview
Natural Language Autoencoders NLAs offer an unsupervised pathway to translate LLM activations into human readable explanations and back again. The approach hinges on two LLM driven modules that collaborate to turn internal signals into language and then test whether the language preserves the original signal.
What is an NLA
Activation verbalizer and activation reconstructor are the two core components. The activation verbalizer maps an activation to a text description. The activation reconstructor takes that description and maps it back to an activation, enabling a reconstruction loop. These two modules are trained together in a reinforcement learning framework that optimizes reconstruction of residual stream activations.
How training works
Because the method is unsupervised, there is no need for ground truth explanations. The system relies on the ability of the reconstructor to regenerate the activation from the textual description. The reinforcement learning signal nudges the verbalizer toward descriptions that are informative yet faithful, balancing conciseness with fidelity.
- Activation verbalizer converts signals into text
- Activation reconstructor converts text back to signals
- Joint training with reinforcement learning to optimize reconstruction
- Unsupervised approach avoids dependence on labeled explanations
Why this matters for interpretability
Interpreting LLM activations is central to AI safety and alignment. NLAs offer a concrete mechanism to generate natural language narratives about internal activations, enabling researchers to inspect, compare, and critique explanations without external annotations. The framework emphasizes fidelity of reconstruction as a core objective, which can help reveal where explanations succeed or fail to capture the underlying signal.
Potential questions and next steps
As with any new interpretability technique, several questions beckon. How robust are the NLAs across different models and tasks? Do the generated explanations introduce or reflect biases? How might this approach be extended to other modalities or to capture temporal dynamics in residual streams? The authors present NLAs as a first step toward unsupervised explainability that complements, rather than replaces, existing methods.
An NLA consists of two LLM modules: an activation verbalizer that maps an activation to a text description and an activation reconstructor that maps the description back to an activation. They are trained jointly with reinforcement learning to reconstruct residual stream activations.
In summary, NLAs provide a novel, unsupervised path to translating activations into natural language explanations, potentially aiding safer and more transparent AI systems while inviting further validation and refinement by the research community.