DeepResearch Environment for Prime Environments

I shipped a production-ready deepresearch environment to the Prime Environments open-source evaluation suite. The work landed the official bounty and is now the reference environment for research-style tasks that blend tool use, web reasoning, and code execution.

Why it Matters

Extends Prime Environments beyond pure coding puzzles into multi-modal research workflows
Establishes task-aware scoring so short-form QA and long-form synthesis are graded fairly
Demonstrates competitive agent performance against the new rubric, unlocking higher difficulty tiers for the benchmark

Core Contributions

Task-aware rubric system with binary accuracy for short-form tasks and weighted factuality/writing scores for long responses
Three integrated tools exposed through the environment API: Exa-powered web search, markdown-based page browsing, and a sandboxed Python interpreter
Dataset variants for demo, short-form, long-form, and tool stress tests so agent builders can target specific capabilities
Judge prompt redesign that eliminates trivial perfect scores and surfaces realistic reward distributions (0.0, 0.94, 1.0)

Usage Highlights

Bounty Outcome

✅ Prime Environments bounty winner for expanding the ecosystem with a deep research setting
✅ Adopted as an official environment for future Prime Intellect evaluations
✅ Recognized for raising the difficulty bar while keeping reproducible scoring

The full implementation lives in PR #205 with extensive documentation and tests covering the new agents, tools, and reward models.

Prime Environments • DeepResearch Agent Suite

Affiliations

DeepResearch Environment for Prime Environments

Why it Matters

Core Contributions

Usage Highlights

Bounty Outcome