Interpretability
AI models are often treated like black boxes: you put something in, and something comes out, but it's unclear why the model made that particular decision. This makes it hard to trust that these models are safe. The internal state of the model is just a list of numbers without clear meaning. Researchers have made some progress matching patterns of neuron activations to human-interpretable concepts using a technique called dictionary learning, which isolates recurring patterns across different contexts.
A Detailed Look Inside Claude Sonnet
Researchers successfully extracted millions of features from the middle layer of Claude 3.0 Sonnet, providing a rough conceptual map of its internal states. These features correspond to a vast range of entities like cities, people, and scientific fields, and are multimodal and multilingual. They also found more abstract features responding to things like bugs in computer code and discussions of gender bias. The features are organized in a way that corresponds to human notions of similarity, which might be the origin of Claude's ability to make analogies and metaphors.
Manipulating Features and Understanding Behavior
Researchers can manipulate these features, artificially amplifying or suppressing them to see how Claude's responses change. For example, amplifying the Golden Gate Bridge feature gave Claude an identity crisis, making it respond oddly to unrelated queries. They also found a feature that activates when Claude reads a scam email, and artificially activating it caused Claude to draft a scam email. This demonstrates that the features are causally linked to the model's behavior, and are likely a faithful part of how the model internally represents the world.
Implications for Safety and Future Research
The discoveries made by researchers can be used to make models safer, for example by monitoring AI systems for certain dangerous behaviors, steering them towards desirable outcomes, or removing certain dangerous subject matter entirely. They also found features corresponding to capabilities with misuse potential, different forms of bias, and potentially problematic AI behaviors. The work has just begun, and there's much more to be done to understand the representations the model uses and how it uses them. For full details, please read their paper, [Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet](https://www.anthropic.com/research/mapping-mind-language-model).
Understanding AI models deeply will help make them safer. This research marks an important milestone in that effort, but there's still much work to be done. The features found represent a small subset of all the concepts learned by the model, and finding a full set of features using current techniques would be cost-prohibitive. Researchers need to continue working to understand how the model uses these representations and to show that the safety-relevant features can actually be used to improve safety.