NL2Repo is a benchmark designed to evaluate the performance of Large Language Models (LLMs) and coding agents on long-horizon tasks that require generating a complete, runnable code repository from scratch (0-to-1). The benchmark consists of 104 distinct tasks, each paired with its own testing environment.
The current setup runs OpenHands in headless batch mode. Model behavior is controlled via the config.toml file. If you need to change the model configuration, please modify config.toml before starting the run.
The system currently uses a file-to-file execution workflow and manages Docker containers via python-on-whales. At the moment, only local execution is supported.
Note: When running in headless mode across multiple machines, you must set up shared file management (e.g., NFS) or manually transfer files to the target machines in advance.
Before starting, ensure that Docker is installed locally and that the following images are available:
docker.all-hands.dev/all-hands-ai/openhands:0.56docker.all-hands.dev/all-hands-ai/runtime:0.56-nikolaik
The runtime image can be customized. The default image is sufficient for running Python-based tasks and comes with Python 3.12 preinstalled. If you need to support other languages, you can build your own runtime image and update the corresponding configuration in openhands/openhands_app.py (line 176).
-
The
test_filesdirectory contains all repository-related task data, including:- A
.txtfile specifying the number of test cases - The repository documentation in
.mdformat - Two
.jsonfiles used for testing
- A
-
All Docker volume mounts used for headless execution are stored in the
workspacesdirectory. Each task is assigned a unique UUID directory. The task-specific configuration file is copied from a template and modified accordingly (mainly to mount the workspace directory into the runtime container). -
Final results are saved in the
resultdirectory. Each task produces a single aggregated.jsonfile, named using the task’s randomly generated UUID. -
The project is launched using a
config.jsonfile. A sample configuration is shown below:
{
"startPro": [
{
"moduleName": "",
"baseUrl": "",
"sk": "",
"proNameList": [
"math-verify"
]
}
],
"max_pool_size": 20
}-
startPro: A list of task nodes.
- Each node corresponds to a single model configuration.
- proNameList: A list of task names, which must match the subdirectory names under
test_files.
-
max_pool_size: The maximum number of concurrent threads. Once this limit is reached, additional tasks will be queued until resources become available.