Meet AutoAgent: the Library for Autonomous AI Optimization

Source
Meet AutoAgent: the Library for Autonomous AI Optimization

Every AI engineer is familiar with the tedious task of prompt tuning. You write a system prompt, run your agent against benchmarks, analyze failures, tweak the prompt, and repeat. It's grunt work disguised as Python code. However, a new open-source library called AutoAgent, developed by Kevin Gu at thirdlayer.inc, offers an unsettling alternative — let an AI do it. AutoAgent is designed for autonomously improving an agent across any domain. In a 24-hour run, it achieved the #1 spot on SpreadsheetBench with a score of 96.5% and the top GPT-5 score on TerminalBench with 55.1%.

AutoAgent is described as 'like autoresearch but for agent engineering.' The idea is to give an AI agent a task and let it build and iterate on an agent harness autonomously overnight. It modifies the system prompt, tools, agent configuration, and orchestration, runs the benchmark, checks the score, keeps or discards changes, and repeats the process. The analogy to autoresearch is that it performs similar propose-train-evaluate cycles, retaining only changes that improve validation loss. AutoAgent ports that same iterative loop from ML training into agent engineering, optimizing not model weights or training hyperparameters, but the harness — the system prompt, tool definitions, and routing logic.

The GitHub project has a deliberately simple structure. agent.py contains the entire harness under test in a single file, including configuration, tool definitions, agent registry, and routing/orchestration logic. program.md contains instructions for the meta-agent and the directive of what kind of agent to build, and this is the only file edited by a human. The human sets the direction in program.md, while the meta-agent reads that directive, inspects agent.py, runs benchmarks, diagnoses failures, rewrites relevant parts of agent.py, and repeats the process. The human never interacts with agent.py directly.

A critical piece of infrastructure that keeps the loop coherent across iterations is results.tsv — an experiment log automatically created and maintained by the meta-agent. It tracks every experiment run, providing the meta-agent with a history to learn from and calibrate what to try next. The full project structure also includes Dockerfile.base, an optional .agent/ directory for reusable artifacts, a tasks/ folder for benchmark payloads, and a jobs/ directory for Harbor job outputs. The metric is the total score produced by the benchmark’s task test suites. The meta-agent hill-climbs on this score, retaining improvements and discarding regressions.

Task formats and Harbor integration allow benchmarks to be expressed as tasks. Each task includes configuration, instructions, and tests that write scores to logs. The meta-agent uses these results for optimization. Notably, instead of only checking answers deterministically, the test suite can use another AI to evaluate whether the agent's output is 'correct enough.' This is common in benchmarks where correct answers are not reducible to string matching.

AutoAgent proves that autonomous harness engineering works — the meta-agent can entirely replace the human prompt-tuning loop, iterating on agent.py overnight without any human touching the harness files directly. Benchmark results validate this approach. In a 24-hour run, AutoAgent achieved top scores on both SpreadsheetBench and TerminalBench, outperforming all hand-engineered entries. The human's role shifts from engineer to director — you don't write or edit agent.py; you write program.md, a plain Markdown directive that steers the meta-agent. This mirrors the broader shift in agent engineering from writing code to setting goals, making AutoAgent a plug-and-play solution for any benchmark.

Related articles