# Open Agent Leaderboard Data

## Overview
Benchmark evaluation results for general-purpose AI agent systems tested across six diverse real-world tasks. The data powers the [Open Agent Leaderboard](https://huggingface.co/spaces/open-agent-leaderboard/leaderboard).

## License
This dataset is licensed under the **Community Data License Agreement -- Permissive -- Version 2.0 (CDLA-Permissive-2.0)**.

- Full license text: [LICENSE-DATA.txt](./LICENSE-DATA.txt)
- Official license page: https://cdla.dev/permissive-2-0/

## Citation

```bibtex
@inproceedings{bandel2026general,
  title={General Agent Evaluation},
  author={Bandel, Elron and Yehudai, Asaf and Jacovi, Michal and Katz, Yoav and Shmueli-Scheuer, Michal and Choshen, Leshem},
  booktitle={ICLR 2026 Workshop on LLM Agents},
  year={2026},
  url={https://arxiv.org/abs/2602.22953}
}
```

## Data Description

### File: `data/results.csv`
Each row represents one agent-model-benchmark combination.

**Key columns:**
- `agent`, `agent_name`: Agent identifier and display name
- `model`, `model_name`: LLM used
- `benchmark`, `benchmark_name`: Benchmark evaluated
- `benchmark_score`: Primary success rate (0--1)
- `average_agent_cost`: Mean cost per task in USD
- `average_steps`: Mean steps per task
- `planned_sessions`, `successful_sessions`, `total_sessions`: Task counts
- `percent_finished`, `percent_successful`: Completion and success rates

### Benchmarks

| Benchmark | Domain |
|-----------|--------|
| AppWorld | App-based task completion |
| BrowseComp+ | Web research and information retrieval |
| SWE-bench | Software engineering issue resolution |
| TauBench-Airline | Customer service (airline) |
| TauBench-Retail | Customer service (retail) |
| TauBench-Telecom | Technical support (telecom) |

### Agents

| Agent | Framework |
|-------|-----------|
| Claude Code | [claude-code](https://github.com/anthropics/claude-code) |
| OpenAI Solo | [openai-agents-python](https://github.com/openai/openai-agents-python) |
| Smolagent | [smolagents](https://github.com/huggingface/smolagents) |
| React | [litellm](https://github.com/BerriAI/litellm) |
| React + Shortlisting | [litellm](https://github.com/BerriAI/litellm) |

### Models

Claude Opus 4.5, GPT-5.2, Gemini Pro 3.

## Data Collection

Each agent was evaluated using the [Exgentic](https://github.com/Exgentic/exgentic) framework with consistent evaluation protocols, resource limits, success criteria, and cost tracking. Agents were tested as general-purpose systems without benchmark-specific tuning.

See the [paper](https://arxiv.org/abs/2602.22953) for full methodology.

## Contact
- GitHub: https://github.com/Exgentic/open-agent-leaderboard
- Website: https://exgentic.ai

---

**License**: CDLA-Permissive-2.0