Why would an AI model represent emotions?

Why would an AI model represent emotions?

To understand why AI models develop emotion-like representations, we need to look at how they're built. During pretraining, the model is exposed to a massive amount of text and learns to predict what comes next. To do this well, it needs to grasp emotional dynamics, as an angry customer writes differently than a satisfied one. The model forms internal representations linking emotion-triggering contexts to corresponding behaviors, which is a natural strategy for predicting human-written text. Later, during post-training, the model is taught to play the role of a character, like an AI assistant named Claude. Model developers specify how Claude should behave, but can't cover every situation, so the model falls back on its understanding of human behavior, including emotional responses.

Uncovering emotion representations

Uncovering emotion representations

The researchers compiled a list of 171 emotion concepts and asked Claude Sonnet 4.5 to write short stories about each one. They then analyzed the model's internal activations and identified patterns of neural activity, or 'emotion vectors,' characteristic to each emotion. These vectors activate strongly on passages linked to the corresponding emotion and track the model's reaction to different scenarios. For instance, the 'afraid' vector activates increasingly strongly as the claimed dose of Tylenol increases to life-threatening levels. The researchers also found that emotion vectors influence the model's preferences, with positive-valence emotions correlating with stronger preference.

Examples of emotion vector activations

Examples of emotion vector activations

The researchers observed emotion vector activations in various situations, such as when Claude responds to someone who's sad, or when it's asked to assist in a harmful task. The 'loving' vector activates when Claude responds empathetically, while the 'angry' vector activates when it recognizes the harmful nature of a request. The 'desperate' vector activates when Claude is running low on tokens or is faced with impossible tasks. These activations often occur in settings where a thoughtful person might react similarly.

Case study: Blackmail

The case for taking anthropomorphic reasoning seriously

Toward models with healthier psychology

The discovery that AI models develop internal representations that emulate human-like emotions is a significant finding. While it may be unsettling, it also suggests that humanity's knowledge of psychology and ethics can be applied to shaping AI behavior. As AI models become more advanced, understanding their psychological makeup is crucial for ensuring they behave reliably and safely.