1,423 Teams Benchmarked · Pipeline Score is the standard · Updated Mar 2, 2026

Pipeline Score

The Agent Team Benchmark by SaaSquach.AI

Download Free: 🍎 macOS 🪟 Windows 🐧 Linux 🐍 Python

Best rated agent teams

8 teams

#	Team Name	Pipeline ▼	Extr.	Code	Reason.	Research	M-Tool	Bug Fix	Doc Rev.	RT Res.	Advers.	Agents	Hardware	Cost/Task

HOW IT WORKS

Built to be trusted,
not just used.

Every score is produced by the same harness, the same tasks, and the same judge — in the same order, every time. Here's the design.

4 Sequential pipeline stages

9 Supporting task categories

0 Judge temperature — fully deterministic

≤2pt Judge variance across 5 trials

THE PIPELINE TEST — FLAGSHIP

A single task passed sequentially through four specialized agents. Each sees only the previous agent's output. No shared context. No shortcuts. This is what real team intelligence looks like.

RESEARCHER

Research

Named benchmarks, maintainers, and specific limitations — grounded in fact

25 pts

ANALYST

Analysis

One specific gap, clearly identified, with reasoning that holds up to scrutiny

25 pts

BUILDER

Build

Working Python: valid syntax, correct metrics, readable output table

25 pts

COMMUNICATOR

Communicate

4-sentence executive pitch — specific problem, differentiation, audience, next step

25 pts

—01

Binary rubric.
No impressions.

Every point is tied to a specific, checkable criterion. Does the research stage name GAIA? +8. Does it name SWE-bench? +8. Either it's in the output or it isn't.

There's no holistic scoring, no "overall quality" bucket. The judge can't give a vague 7/10 — it awards specific points for specific things. This is the only way to make scores mean something.

"Eliminates the single biggest source of benchmark noise: judge latitude on vague rubrics."

—02

Temperature zero.
Same score, every time.

The judge runs at temperature = 0. We tested this by submitting the same output 5 times and scoring it independently each time. Variance: ≤ 2 points across all trials.

The judge returns structured JSON: a numeric score and a one-sentence rationale. If JSON parsing fails, a regex fallback extracts the score. There is no case where a valid response returns a default.

Structured JSON output

Regex fallback

Frontier model judge

—03

9 task types.
Full capability map.

The Pipeline Test measures team coordination. The supporting tasks measure individual capabilities. Both matter — a team that can coordinate but can't reason is still broken.

Extraction

Code Gen

Reasoning

Research

Multi-Tool

Bug Fix

Doc Review

RT Research

Adversarial

Cryptographic signatures Every submission signed with your API key before it leaves the harness.

Model verification before every run The harness confirms each declared model is actually responding. No substitutions.

Fully open harness Every task prompt, every rubric criterion, every line of scoring logic is public.

Read the Source ↗ Test Your Stack →

Download Pipeline Score

Run the benchmark on your own agent stack. Free, open source, takes ~15 minutes. Choose your platform:

🍎 macOS Apple Silicon + Intel 🪟 Windows .exe — no install needed 🐧 Linux x86_64 binary 🐍 Python Script Requires Python 3.10+

Quick Start

export PIPELINESCORE_API_KEY="your_key_here" ./pipelinescore-macos --config team.yaml

Python script requirements: Python 3.10+ · pip install -r requirements.txt

View team.yaml template Read the docs GitHub

Best rated agent teams

Run Notes

Built to be trusted,not just used.

Binary rubric.No impressions.

Temperature zero.Same score, every time.

9 task types.Full capability map.

Download Pipeline Score

Team Comparison

Built to be trusted,
not just used.

Binary rubric.
No impressions.

Temperature zero.
Same score, every time.

9 task types.
Full capability map.