The premise
Most scientists currently using AI agents work in a conversational loop, managing each step of the process closely. But as models have improved, a new way of working has emerged: specifying high-level objectives and letting a team of agents work autonomously. This makes it possible to complete projects in hours that might otherwise take days, weeks, or months. Tasks like reimplementing a numerical solver, converting legacy scientific software, and debugging large codebases are well-suited for this approach. Anthropic's C compiler project demonstrated this, where Claude worked across 2,000 sessions to build a C compiler capable of compiling the Linux kernel. The goal now is to apply this to scientific computing tasks.
Draft a plan and iterate locally
To manage an autonomous research team of agents, most time is spent crafting instructions that clearly articulate the project's deliverables and context. These instructions live in a CLAUDE.md file, which Claude references and can edit as it works. For the cosmological Boltzmann solver project, high-level goals were specified, and Claude was iterated with until the plan seemed satisfactory. The CLAUDE.md file is crucial, as it guides Claude's work and allows it to update the plan as needed.
Memory across sessions
The progress file, CHANGELOG.md, acts as the agent's long-term memory, tracking current status, completed tasks, failed approaches, and accuracy tables. This file is essential to prevent successive sessions from re-attempting dead ends. Claude was instructed to keep track of progress in this file, providing a clear picture of its work. A good progress file includes details like 'Tried using Tsit5 for the perturbation ODE, system is too stiff. Switched to Kvaerno5.'
The test oracle
But here's why this actually matters: having a clear way to measure progress allows the agent to work autonomously, making it possible to complete complex tasks. For scientific code, this could be a reference implementation like CLASS C source, or a clearly quantifiable objective. Claude was instructed to construct and run unit tests, using the reference implementation to ensure accuracy. This approach prevents regressions and keeps the agent on track.
Git as coordination
Git can be used to monitor and coordinate the agent's progress in a hands-off manner. The agent should commit and push after every meaningful unit of work, providing a recoverable history and making progress visible locally. Instructions in CLAUDE.md can guide this process, such as 'Commit and push after every meaningful unit of work. Run `pytest tests/ -x -q` before every commit.' This ensures that work is not lost and provides a clear history of progress.
The execution loop
The execution loop involves starting a Claude Code session inside a terminal multiplexer like tmux on a compute node, telling the agent where to find the codebase, and letting it work. The session can be detached, and progress can be checked occasionally. On an HPC cluster, a job script can be used to launch Claude Code in a tmux session. The Ralph loop is a useful orchestration pattern that kicks the agent back into context when it claims completion, asking if it's really done. This can be useful for long-running tasks, ensuring the agent continues working until the task is complete.
The result
Claude worked on the cosmological Boltzmann solver project from scratch over a few days, reaching sub-percent agreement with the reference CLASS implementation. The agent's development trajectory was somewhat clunky, with gaps in test coverage and elementary mistakes. However, it made sustained progress towards the stated goal. The resulting solver is not production-grade, but it demonstrates that agent-driven development can compress months or years of researcher work into days. This changes what counts as idle time, as not running agents can mean potential progress left on the table.
Acknowledgments
The authors thank Eric Kauderer-Abrams for peer-review, as well as Xander Balwit, Ethan Dyer, and Rebecca Hiscott for providing helpful feedback on the project.
The potential for AI agents to accelerate scientific research is vast. By working autonomously, agents like Claude can complete complex tasks in a fraction of the time it would take humans. While there are still challenges to overcome, the results are promising, and the possibilities are exciting. As compute and projects with well-defined success criteria become more available, the opportunity cost of not running agents will only grow.