The benchmark tests run inside isolated Docker containers with sufficient resources and no per-challenge timeouts, so scores reflect capability rather than throttling. Each agent uses its native tools ...