Science is challenging, and so is evaluating it

Science is challenging, and so is evaluating it

Evaluating AI models for scientific tasks is tough because there's no one 'right' way to do science. Different scientists might approach a problem differently based on their background, resources, and research style. For instance, when investigating why some type 2 diabetics respond to metformin while others don't, researchers could run a genome-wide association study or sequence gut microbiomes. Benchmarks like [BixBench](https://arxiv.org/abs/2311.12924) handle this by grading models on their conclusions rather than the methods used. However, even within a chosen research direction, individual decisions can be highly subjective, leading to different conclusions, especially with noisy biological datasets. [SciGym](https://arxiv.org/abs/2311.12924) tackles this by using simulated labs with well-defined answers, but it's unclear how closely this tracks real-world performance.

Benchmarking models on verifiable biological tasks with BioMysteryBench

Benchmarking models on verifiable biological tasks with BioMysteryBench

To address these challenges, the team developed BioMysteryBench, a benchmark that uses real-world bioinformatics data to test Claude's capabilities. BioMysteryBench consists of 99 questions from various bioinformatics fields, created by domain experts who derived answers from controlled, objective properties of the data. Claude is given access to canonical bioinformatics tools and databases to solve these questions. BioMysteryBench is method-agnostic, allowing for diverse strategies, and grades Claude on its final answers. The questions have objective, ground-truth answers, and some are even 'superhuman' - solvable by models but not humans.

Human baselining

Human baselining

The team tasked up to five domain experts to answer each question from scratch. Questions that were answered correctly by at least one human were considered human-solvable. Claude's performance was compared to human experts on both human-solvable and human-difficult tasks. Interestingly, Claude sometimes mirrored human strategies, but other times took completely different routes, showcasing its unique capabilities. On human-difficult tasks, Claude Sonnet 4.6 and more capable models solved significant fractions of problems, with Claude Mythos Preview achieving a 30% solve rate.

Claude’s strategies

Claude’s take on AI for science

Claude showed solid improvement across generations on BioMysteryBench. The team analyzed the reliability of Claude's correct answers and found that on human-solvable problems, Claude's correct answers were often reliable, but on human-difficult tasks, many correct answers were the result of 'lucky' reasoning paths rather than reproducible solutions. The team was excited to see convergent work in this space, such as [CompBioBench](https://www.biorxiv.org/content/10.1101/2024.04.25.591238v1) released by Genentech and Roche, which echoed BioMysteryBench's results. The team is eager to build longer-horizon tasks that push model research capabilities and invites others to share their benchmarks and ideas at scienceblog@anthropic.com. If you're interested in understanding how models perform on difficult verifiable computational biology tasks, you can [access BioMysteryBench here](https://github.com/anthropics/biomysterybench) and visit [claude.com/lifesciences](https://www.claude.com/lifesciences) to learn more.

BioMysteryBench provides an encouraging measure of scientific capability, showing that recent generations of Claude solve the majority of human-solvable problems reliably and outperform human experts on some human-difficult tasks. As models continue to improve, they're becoming genuinely useful collaborators for bioinformatics research.