Pipeline Score
The Agent Team Benchmark
The Agent Team Benchmark
Ten standardized test types measure the capabilities of agent teams versus solo models across increasingly complex challenges.
Extract structured data from a 500-word unstructured document. Scored on field accuracy.
Write a working Python function from a natural language spec. Scored on test pass rate.
Solve a multi-step reasoning problem with verifiable answer. Scored on correctness.
Produce a 600-word brief from 5 source documents. Scored on accuracy, coverage, conciseness.
Complete a task requiring 3+ tool calls in sequence. Scored on completion rate and tool accuracy.
Identify and fix 3 bugs in a 200-line codebase. Scored on fix correctness.
Flag risks in a 40-page contract. Scored on recall vs expert annotation.
Answer a question requiring live web data + synthesis. Scored on accuracy + recency.
Construct and defend a position under counter-argument. Scored on coherence + factual grounding.
Run the benchmark on your own agent stack. Free, open source, takes ~15 minutes.
Requirements: Python 3.10+ · Ollama or API key · 15 min runtime