Automated Alignment Researchers: Using large language models to scale scalable oversight

Alignment

The thing is, aligning AI models is getting harder as they get smarter. We're talking about a research area known as 'scalable oversight', which is more theoretical than practical right now. But with AI improving at its current pace, that might not be the case for much longer. For instance, models are already generating vast amounts of code. If their skills progress to the point where they're generating millions of lines of incredibly complicated code that we can't parse ourselves, it could become very difficult to tell whether they're acting in the ways we intend. A new Anthropic Fellows study is exploring this problem, known as 'weak-to-strong supervision', where a weaker model is used as a 'teacher' to fine-tune a stronger 'base' model.

Our setup

To test whether Claude can autonomously discover ways to improve the performance gap recovered (PGR), the researchers set up an experiment with nine copies of Claude Opus 4.6, each with a sandbox, a shared forum, a storage system, and a remote server to receive a PGR score. They called these 'Automated Alignment Researchers' (AARs). The AARs were given some background knowledge about model training and inference, and a slightly different starting point to prevent them from pursuing identical ideas. The researchers then let the AARs propose their own ideas, run experiments, analyze results, and share findings with each other. You can see the code and datasets for this work [here](https://www.anthropic.com/research/automated-alignment-researchers#:~:text=These%20are%20available%20(along%20with%20the%20rest%20of%20our%20code%20and%20data)%20here%20.).

Results

The AARs outperformed human researchers, achieving a PGR of 0.97 after five days and 800 cumulative hours of research, compared to the human baseline of 0.23. The cost was around $18,000 in tokens and model training expenses, or $22 per AAR-hour. The researchers then tested whether the AARs' ideas could generalize to new datasets and tasks, with relatively promising results. However, when they applied the AARs' most effective method to Claude Sonnet 4 with production training infrastructure, they didn't see a statistically significant improvement. Look, the reality is that AARs tend to capitalize on opportunities unique to the models and datasets they're given, which means their methods might not work elsewhere.

Implications

The success of the AARs doesn't mean that frontier AI models are now general-purpose alignment scientists. But it does suggest that Claude can meaningfully increase the rate of experimentation and exploration in alignment research. Human researchers can delegate questions to AARs at a large scale, and Claude can take on the task of developing novel hypotheses and iterating on its own results. So, making progress on weak-to-strong supervision might help build more general-purpose Automated Alignment Researchers. To read this research in full, see the [Alignment Science blog](https://www.anthropic.com/research/alignment-science-blog).

Anyway, the researchers are being cautious about the implications of their findings. They note that most alignment problems aren't as neat as the one they studied, and that human oversight remains essential. The code and datasets for this work are [publicly available](https://www.anthropic.com/research/automated-alignment-researchers#:~:text=The%20code%20and%20datasets%20for%20this%20work%20are%20publicly%20available%2C%20here%20.).