UC Berkeley's DocETL Brings Declarative LLM-Powered Data Processing to VLDB 2025

Key Takeaways

▸DocETL makes LLM-powered data processing more accessible by allowing users to define operations in natural language rather than writing code
▸Automatic optimization reduces computational costs and improves accuracy by intelligently selecting models, rewriting prompts, and substituting code where beneficial
▸The framework supports multiple LLM providers (OpenAI, Anthropic, etc.) and includes both Python API and low-code YAML declarative syntax

Source:

Hacker Newshttps://github.com/ucbepic/docetl↗

Summary

Researchers at UC Berkeley's EPIC Data Lab have published DocETL, an open-source framework that simplifies complex data processing pipelines using large language models. Rather than writing individual LLM calls and manually optimizing them, users can declare operations in natural language—such as "pull out every complaint in this ticket"—and DocETL handles the heavy lifting with map-reduce operators, automatic parallelization, and intelligent optimization. The framework automatically tunes accuracy, cost, and latency by swapping models, rewriting prompts, and replacing LLM subtasks with code where appropriate. The paper was published at VLDB 2025, one of the top database systems conferences, alongside a companion research paper on DocWrangler (Best Paper Honorable Mention at UIST 2025), an interactive UI for visual pipeline development.

Distributed across open source with interactive playground (DocWrangler UI) and Claude Code integration for quick pipeline development

Editorial Opinion

DocETL addresses a real pain point in the LLM application space—the gap between simple one-off LLM calls and production-grade data pipelines. By combining declarative syntax with automatic optimization, it could significantly lower the barrier to entry for building sophisticated data processing workflows. The dual publications at VLDB and UIST suggest the authors have thought deeply about both the systems architecture and the user experience, which is rare in academic work.

UC Berkeley

RESEARCH UC Berkeley2026-07-02

UC Berkeley's DocETL Brings Declarative LLM-Powered Data Processing to VLDB 2025

Key Takeaways

▸DocETL makes LLM-powered data processing more accessible by allowing users to define operations in natural language rather than writing code
▸Automatic optimization reduces computational costs and improves accuracy by intelligently selecting models, rewriting prompts, and substituting code where beneficial
▸The framework supports multiple LLM providers (OpenAI, Anthropic, etc.) and includes both Python API and low-code YAML declarative syntax

Source:

Hacker Newshttps://github.com/ucbepic/docetl↗

Summary

Distributed across open source with interactive playground (DocWrangler UI) and Claude Code integration for quick pipeline development

Editorial Opinion

DocETL addresses a real pain point in the LLM application space—the gap between simple one-off LLM calls and production-grade data pipelines. By combining declarative syntax with automatic optimization, it could significantly lower the barrier to entry for building sophisticated data processing workflows. The dual publications at VLDB and UIST suggest the authors have thought deeply about both the systems architecture and the user experience, which is rare in academic work.

UC Berkeley's DocETL Brings Declarative LLM-Powered Data Processing to VLDB 2025

Key Takeaways

Summary

Editorial Opinion

More from UC Berkeley

UC Berkeley Researchers Introduce ENPIRE: Autonomous Framework for Real-World Robot Policy Improvement

UC Berkeley ADRS Project Explores Memory Management for AI-Driven GPU Code Generation

CommBench: Researchers Reveal Critical Gap in LLMs' GPU Communication Code Generation

Comments

Suggested

Anthropic Launches Claude Science: AI Research Workbench for Life Scientists

Google Retrofits Multi-Token Prediction Into Frozen Gemini Nano Models for Faster Mobile AI

Palantir CEO Alex Karp Warns Industry Against Problematic AI Sales Practices

UC Berkeley's DocETL Brings Declarative LLM-Powered Data Processing to VLDB 2025

Key Takeaways

Summary

Editorial Opinion

More from UC Berkeley

UC Berkeley Researchers Introduce ENPIRE: Autonomous Framework for Real-World Robot Policy Improvement

UC Berkeley ADRS Project Explores Memory Management for AI-Driven GPU Code Generation

CommBench: Researchers Reveal Critical Gap in LLMs' GPU Communication Code Generation

Comments

Suggested

Anthropic Launches Claude Science: AI Research Workbench for Life Scientists

Google Retrofits Multi-Token Prediction Into Frozen Gemini Nano Models for Faster Mobile AI

Palantir CEO Alex Karp Warns Industry Against Problematic AI Sales Practices