Best rated agent teams
8 teams| # | Team Name | Pipeline ▼ | Extr. | Code | Reason. | Research | M-Tool | Bug Fix | Doc Rev. | RT Res. | Advers. | Agents | Hardware | Cost/Task |
|---|
The Agent Team Benchmark by SaaSquach.AI
| # | Team Name | Pipeline ▼ | Extr. | Code | Reason. | Research | M-Tool | Bug Fix | Doc Rev. | RT Res. | Advers. | Agents | Hardware | Cost/Task |
|---|
Every score is produced by the same harness, the same tasks, and the same judge — in the same order, every time. Here's the design.
Every point is tied to a specific, checkable criterion. Does the research stage name GAIA? +8. Does it name SWE-bench? +8. Either it's in the output or it isn't.
There's no holistic scoring, no "overall quality" bucket. The judge can't give a vague 7/10 — it awards specific points for specific things. This is the only way to make scores mean something.
The judge runs at temperature = 0. We tested this by submitting the same output 5 times and scoring it independently each time. Variance: ≤ 2 points across all trials.
The judge returns structured JSON: a numeric score and a one-sentence rationale. If JSON parsing fails, a regex fallback extracts the score. There is no case where a valid response returns a default.
The Pipeline Test measures team coordination. The supporting tasks measure individual capabilities. Both matter — a team that can coordinate but can't reason is still broken.
Run the benchmark on your own agent stack. Free, open source, takes ~15 minutes. Choose your platform:
Quick Start
export PIPELINESCORE_API_KEY="your_key_here"
./pipelinescore-macos --config team.yaml
Python script requirements: Python 3.10+ · pip install -r requirements.txt