DeepResearch Environment for Prime Environments
I shipped a production-ready deepresearch environment to the Prime Environments open-source evaluation suite. The work landed the official bounty and is now the reference environment for research-style tasks that blend tool use, web reasoning, and code execution.
Why it Matters
- Extends Prime Environments beyond pure coding puzzles into multi-modal research workflows
- Establishes task-aware scoring so short-form QA and long-form synthesis are graded fairly
- Demonstrates competitive agent performance against the new rubric, unlocking higher difficulty tiers for the benchmark
Core Contributions
- Task-aware rubric system with binary accuracy for short-form tasks and weighted factuality/writing scores for long responses
- Three integrated tools exposed through the environment API: Exa-powered web search, markdown-based page browsing, and a sandboxed Python interpreter
- Dataset variants for demo, short-form, long-form, and tool stress tests so agent builders can target specific capabilities
- Judge prompt redesign that eliminates trivial perfect scores and surfaces realistic reward distributions (0.0, 0.94, 1.0)
Usage Highlights
Bounty Outcome
- ✅ Prime Environments bounty winner for expanding the ecosystem with a deep research setting
- ✅ Adopted as an official environment for future Prime Intellect evaluations
- ✅ Recognized for raising the difficulty bar while keeping reproducible scoring
The full implementation lives in PR #205 with extensive documentation and tests covering the new agents, tools, and reward models.
