BEAVER: MIT Releases Large-Scale Enterprise Benchmark for LLM Text-to-SQL Systems
Key Takeaways
- ▸BEAVER contains 9,128 real-world text-to-SQL queries from 812 tables across 19 enterprise domains
- ▸Detailed subtask annotations enable fine-grained analysis of LLM failures in multi-table retrieval, join detection, column mapping, domain knowledge, and query decomposition
- ▸Dataset sourced from private organizations ensures practical relevance to real enterprise data warehouse scenarios
Summary
Researchers at MIT and collaborating institutions have released BEAVER, a comprehensive enterprise benchmark for evaluating text-to-SQL systems on private data warehouses. The benchmark comprises 9,128 queries across 812 tables spanning 19 diverse domains, with 7,978 queries publicly released and a held-out private test set for rigorous evaluation. Queries and databases were sourced from real private organizations, ensuring practical relevance to enterprise use cases.
BEAVER addresses a critical gap in LLM evaluation by providing detailed subtask annotations that break down text-to-SQL complexity into five distinct evaluation areas: multi-table retrieval, join key detection, column mapping, domain knowledge extraction, and query decomposition. The dataset further categorizes queries into three complexity levels: complex queries without domain knowledge, domain-specific queries with minimal complexity, and domain-specific complex queries. This granular annotation scheme enables researchers to identify specific failure modes and areas where LLMs struggle with real-world SQL generation tasks.
The benchmark is positioned as a tool for both advancing research and evaluating enterprise AI systems that must handle complex database queries over private data warehouses. Researchers wishing to submit their methods for evaluation are invited to contact the team with their approach details and optional paper or code links.
- 7,978 queries are publicly available, with a withheld private test set for standardized evaluation
Editorial Opinion
BEAVER represents an important contribution to LLM evaluation infrastructure at a critical moment when text-to-SQL has become a core capability for enterprise AI systems. By grounding the benchmark in real corporate data and queries, rather than synthetic benchmarks, the researchers have created a more meaningful test of practical SQL generation. The granular subtask annotations go beyond simple end-to-end metrics to pinpoint where LLMs fail, which should accelerate progress on enterprise-grade text-to-SQL systems.



