Systematic evaluation of AI agents across diverse environments — without domain-specific tuning.
The Pareto frontier of agent efficiency — accuracy vs. spend.