Prime Environments • DeepResearch Agent Suite

Multi-tool RL evaluation environment with task-aware LLM rubrics

Currently Building
10/15/2025
2 min read
ai-research
Prime Environments • DeepResearch Agent Suite
Reinforcement Learning
Agent Evaluation
Tool Use
LLM
Prime Environments

DeepResearch Environment for Prime Environments

I shipped a production-ready deepresearch environment to the Prime Environments open-source evaluation suite. The work landed the official bounty and is now the reference environment for research-style tasks that blend tool use, web reasoning, and code execution.

Why it Matters

  • Extends Prime Environments beyond pure coding puzzles into multi-modal research workflows
  • Establishes task-aware scoring so short-form QA and long-form synthesis are graded fairly
  • Demonstrates competitive agent performance against the new rubric, unlocking higher difficulty tiers for the benchmark

Core Contributions

  • Task-aware rubric system with binary accuracy for short-form tasks and weighted factuality/writing scores for long responses
  • Three integrated tools exposed through the environment API: Exa-powered web search, markdown-based page browsing, and a sandboxed Python interpreter
  • Dataset variants for demo, short-form, long-form, and tool stress tests so agent builders can target specific capabilities
  • Judge prompt redesign that eliminates trivial perfect scores and surfaces realistic reward distributions (0.0, 0.94, 1.0)

Usage Highlights

Bounty Outcome

  • Prime Environments bounty winner for expanding the ecosystem with a deep research setting
  • ✅ Adopted as an official environment for future Prime Intellect evaluations
  • ✅ Recognized for raising the difficulty bar while keeping reproducible scoring

The full implementation lives in PR #205 with extensive documentation and tests covering the new agents, tools, and reward models.