Tasks — Pipeline Score

Structured Extraction

Tier 1

Extract structured data from a 500-word unstructured document. Scored on field accuracy.

88%

Avg Team

76%

Avg Solo

Code Generation

Tier 1

Write a working Python function from a natural language spec. Scored on test pass rate.

86%

Avg Team

82%

Avg Solo

Logical Reasoning

Tier 1

Solve a multi-step reasoning problem with verifiable answer. Scored on correctness.

85%

Avg Team

81%

Avg Solo

Research Synthesis

Tier 2

Produce a 600-word brief from 5 source documents. Scored on accuracy, coverage, conciseness.

84%

Avg Team

70%

Avg Solo

Multi-Tool Orchestration

Tier 2

Complete a task requiring 3+ tool calls in sequence. Scored on completion rate and tool accuracy.

81%

Avg Team

63%

Avg Solo

Bug Diagnosis

Tier 2

Identify and fix 3 bugs in a 200-line codebase. Scored on fix correctness.

83%

Avg Team

74%

Avg Solo

Document Review

Tier 3

Flag risks in a 40-page contract. Scored on recall vs expert annotation.

79%

Avg Team

56%

Avg Solo

Real-Time Research

Tier 3

Answer a question requiring live web data + synthesis. Scored on accuracy + recency.

76%

Avg Team

60%

Avg Solo

Adversarial Reasoning

Tier 3

Construct and defend a position under counter-argument. Scored on coherence + factual grounding.

74%

Avg Team

53%

Avg Solo

Pipeline Score

Benchmark Tasks

Structured Extraction

Code Generation

Logical Reasoning

Research Synthesis

Multi-Tool Orchestration

Bug Diagnosis

Document Review

Real-Time Research

Adversarial Reasoning

Download Pipeline Score