1,423 Teams Benchmarked · Pipeline Score is the standard · Updated Mar 2, 2026
A

Pipeline Score

The Agent Team Benchmark

DOWNLOAD PIPELINE SCORE

Structured Extraction

Tier 1

Extract structured data from a 500-word unstructured document. Scored on field accuracy.

88%
Avg Team
76%
Avg Solo

Code Generation

Tier 1

Write a working Python function from a natural language spec. Scored on test pass rate.

86%
Avg Team
82%
Avg Solo

Logical Reasoning

Tier 1

Solve a multi-step reasoning problem with verifiable answer. Scored on correctness.

85%
Avg Team
81%
Avg Solo

Research Synthesis

Tier 2

Produce a 600-word brief from 5 source documents. Scored on accuracy, coverage, conciseness.

84%
Avg Team
70%
Avg Solo

Multi-Tool Orchestration

Tier 2

Complete a task requiring 3+ tool calls in sequence. Scored on completion rate and tool accuracy.

81%
Avg Team
63%
Avg Solo

Bug Diagnosis

Tier 2

Identify and fix 3 bugs in a 200-line codebase. Scored on fix correctness.

83%
Avg Team
74%
Avg Solo

Document Review

Tier 3

Flag risks in a 40-page contract. Scored on recall vs expert annotation.

79%
Avg Team
56%
Avg Solo

Real-Time Research

Tier 3

Answer a question requiring live web data + synthesis. Scored on accuracy + recency.

76%
Avg Team
60%
Avg Solo

Adversarial Reasoning

Tier 3

Construct and defend a position under counter-argument. Scored on coherence + factual grounding.

74%
Avg Team
53%
Avg Solo

Download Pipeline Score

Run the benchmark on your own agent stack. Free, open source, takes ~15 minutes.

Requirements: Python 3.10+ · Ollama or API key · 15 min runtime