UC Berkeley's DocETL Brings Declarative LLM-Powered Data Processing to VLDB 2025
Key Takeaways
- ▸DocETL makes LLM-powered data processing more accessible by allowing users to define operations in natural language rather than writing code
- ▸Automatic optimization reduces computational costs and improves accuracy by intelligently selecting models, rewriting prompts, and substituting code where beneficial
- ▸The framework supports multiple LLM providers (OpenAI, Anthropic, etc.) and includes both Python API and low-code YAML declarative syntax
Summary
Researchers at UC Berkeley's EPIC Data Lab have published DocETL, an open-source framework that simplifies complex data processing pipelines using large language models. Rather than writing individual LLM calls and manually optimizing them, users can declare operations in natural language—such as "pull out every complaint in this ticket"—and DocETL handles the heavy lifting with map-reduce operators, automatic parallelization, and intelligent optimization. The framework automatically tunes accuracy, cost, and latency by swapping models, rewriting prompts, and replacing LLM subtasks with code where appropriate. The paper was published at VLDB 2025, one of the top database systems conferences, alongside a companion research paper on DocWrangler (Best Paper Honorable Mention at UIST 2025), an interactive UI for visual pipeline development.
- Distributed across open source with interactive playground (DocWrangler UI) and Claude Code integration for quick pipeline development
Editorial Opinion
DocETL addresses a real pain point in the LLM application space—the gap between simple one-off LLM calls and production-grade data pipelines. By combining declarative syntax with automatic optimization, it could significantly lower the barrier to entry for building sophisticated data processing workflows. The dual publications at VLDB and UIST suggest the authors have thought deeply about both the systems architecture and the user experience, which is rare in academic work.



