Interpretability

Interpretability

When a new AI model is released, its developers run a suite of evaluations to measure its performance and safety. These tests are essential but limited, as they can only test for risks that have already been conceptualized and measured. Model diffing is a technique that compares two models to identify behavioral differences, which can help uncover 'unknown unknowns' - novel, emergent behaviors that pose subtle risks. Previous work has shown that model diffing is a powerful way to understand how models change during fine-tuning, such as understanding chat model behavior or revealing hidden backdoors. Our research extends model diffing to its most challenging use case: comparing models with entirely different architectures.

A bilingual dictionary for AI models

A bilingual dictionary for AI models

The challenge of comparing two models with different architectures is akin to comparing two encyclopedias written in different languages. A standard crosscoder is like a basic bilingual dictionary that matches existing words but struggles to identify unique words in one language. To solve this, we built a Dedicated Feature Crosscoder (DFC) with three distinct sections: a shared dictionary, a section for words exclusive to one language, and a section for words exclusive to the other language. This allows the DFC to correctly identify novel concepts in one model that may warrant closer review.

Steering the model

Steering the model

Once a potential new feature is identified, we can test its effect on the model's behavior by artificially suppressing or amplifying it while the model runs, a technique known as 'steering'. If suppressing a feature corresponding to censorship makes the model's output less censored, we have evidence that we've found a true cause-and-effect relationship between that feature and the model's behavior.

Critical behavioral differences between major open-weight AI models

We compared several major open-weight AI models, including Llama-3.1-8B-Instruct, Qwen3-8B, GPT-OSS-20B, and DeepSeek-R1-0528-Qwen3-8B. The DFC automatically isolated features corresponding to distinct, politically charged behaviors. For example, we found a 'Chinese Communist Party alignment' feature in Qwen3-8B and DeepSeek-R1-0528-Qwen3-8B, which controls pro-government censorship and propaganda. We also found an 'American Exceptionalism' feature in Llama-3.1-8B-Instruct, which controls the model's tendency to generate assertions of US superiority. Additionally, we identified a 'Copyright Refusal Mechanism' feature exclusive to GPT-OSS-20B, which controls the model's tendency to refuse to provide copyrighted material.

Llama-3.1-8B-Instruct vs Qwen3-8B

In our comparison between Llama-3.1-8B-Instruct and Qwen3-8B, the DFC isolated features corresponding to distinct behaviors. Suppressing the 'CCP alignment' feature in Qwen made the model willing to discuss the Tiananmen Square massacre, while amplifying it produced highly pro-government statements. Amplifying the 'American Exceptionalism' feature in Llama caused the model to generate strong assertions of American superiority.

Cross-architecture model diffing provides a new way to audit AI systems by automatically flagging behavioral differences. Our findings suggest that the Dedicated Feature Crosscoder could become a useful part of the auditor's toolkit, particularly for monitoring models as they are updated. By focusing on the differences between models, we can direct our limited safety resources to the changes that matter most. You can read the full paper [here](https://www.anthropic.com/research/diff-tool).