Arbor: Autonomous Research Framework Unifies Long-Horizon Optimization Across Domains
Key Takeaways
- ▸Arbor introduces Hypothesis Tree Refinement (HTR), a persistent tree data structure that tracks hypotheses, artifacts, evidence, and distilled insights across time, transforming autonomous research from episodic to cumulative
- ▸The framework achieves generality by working across fundamentally different research domains without task-specific tuning, unified under the Autonomous Optimization operational setting
- ▸Delivered 2.5x average improvement over Codex and Claude Code baselines on six real research tasks, with state-of-the-art results on MLE-Bench Lite benchmarks
Summary
Researchers at NLPIR Lab have introduced Arbor, a general-purpose framework for autonomous scientific research that combines strategic coordination, isolated hypothesis testing, and a persistent knowledge tree structure to enable cumulative optimization rather than trial-and-error exploration. The framework addresses a fundamental challenge in autonomous agents: how to conduct long-horizon research that learns from prior experiments and carries lessons forward iteratively. Arbor unifies diverse research tasks—including model training, harness engineering, and data synthesis—under a single Autonomous Optimization framework, achieving 2.5x the average relative improvement of existing baselines (Codex and Claude Code) and reaching 86.36% performance on MLE-Bench Lite with GPT-5.5. The team has open-sourced the implementation as a fully runnable CLI and Agent Skill Suite, enabling integration with existing coding agents and making advanced autonomous research capabilities broadly accessible.
- Open-source release includes both a standalone CLI for long-running experiments and an Agent Skill Suite for integration with systems like Claude Code, democratizing access to structured autonomous research
Editorial Opinion
Arbor represents a meaningful shift in autonomous agent design—moving from reactive, trial-and-error systems to sophisticated researchers that accumulate knowledge through structured exploration. The insight that a persistent knowledge tree and disciplined hypothesis management can deliver 2.5x improvements suggests that how we scaffold exploration is as important as raw model capability. The framework's ability to unify diverse research tasks while remaining deployable in real codebases demonstrates a rare balance between research sophistication and practical utility. Open-sourcing this work could accelerate research cycles across the industry, especially for teams without unlimited computational budgets.



