Nl2RepoBench

Project Overview

NL2Repo is a benchmark designed to evaluate the performance of Large Language Models (LLMs) and coding agents on long-horizon tasks that require generating a complete, runnable code repository from scratch (0-to-1). The benchmark consists of 104 distinct tasks, each paired with its own testing environment.

Running the Code

The current setup runs OpenHands in headless batch mode. Model behavior is controlled via the config.toml file. If you need to change the model configuration, please modify config.toml before starting the run.

The system currently uses a file-to-file execution workflow and manages Docker containers via python-on-whales. At the moment, only local execution is supported.

Note: When running in headless mode across multiple machines, you must set up shared file management (e.g., NFS) or manually transfer files to the target machines in advance.

Prerequisites

Before starting, ensure that Docker is installed locally and that the following images are available:

docker.all-hands.dev/all-hands-ai/openhands:0.56
docker.all-hands.dev/all-hands-ai/runtime:0.56-nikolaik

The runtime image can be customized. The default image is sufficient for running Python-based tasks and comes with Python 3.12 preinstalled. If you need to support other languages, you can build your own runtime image and update the corresponding configuration in openhands/openhands_app.py (line 176).

Data Layout

The test_files directory contains all repository-related task data, including:
- A .txt file specifying the number of test cases
- The repository documentation in .md format
- Two .json files used for testing
All Docker volume mounts used for headless execution are stored in the workspaces directory. Each task is assigned a unique UUID directory. The task-specific configuration file is copied from a template and modified accordingly (mainly to mount the workspace directory into the runtime container).
Final results are saved in the result directory. Each task produces a single aggregated .json file, named using the task’s randomly generated UUID.
The project is launched using a config.json file. A sample configuration is shown below:

{
  "startPro": [
    {
      "moduleName": "",
      "baseUrl": "",
      "sk": "",
      "proNameList": [
        "math-verify"
      ]
    }
  ],
  "max_pool_size": 20
}

Configuration Fields

startPro: A list of task nodes.
- Each node corresponds to a single model configuration.
- proNameList: A list of task names, which must match the subdirectory names under test_files.
max_pool_size: The maximum number of concurrent threads. Once this limit is reached, additional tasks will be queued until resources become available.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
docker_self		docker_self
openhands		openhands
template		template
test_files		test_files
tests		tests
workspaces		workspaces
.DS_Store		.DS_Store
__init__.py		__init__.py
config.json		config.json
logging_config.py		logging_config.py
main.py		main.py
only_test.py		only_test.py
readme.md		readme.md
requirements.txt		requirements.txt
test_data_service.py		test_data_service.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Nl2RepoBench

Project Overview

Running the Code

Prerequisites

Data Layout

Configuration Fields

About

Uh oh!

Releases

Packages

Uh oh!

Languages

multimodal-art-projection/NL2RepoBench

Folders and files

Latest commit

History

Repository files navigation

Nl2RepoBench

Project Overview

Running the Code

Prerequisites

Data Layout

Configuration Fields

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages