Databricks Scales to 10 Trillion Monitoring Samples Per Day With Custom Infrastructure
Key Takeaways
- ▸Databricks' monitoring infrastructure now ingests 10 trillion samples daily and tracks 5 billion active timeseries across 70 cloud regions
- ▸The company built Pantheon, a custom fork of open-source Thanos TSDB, because off-the-shelf solutions couldn't handle scale and complexity requirements
- ▸Infrastructure improvements reduced monitoring downtime by 5x and eliminated millions in annual cloud costs
Summary
Databricks has shared details of its custom monitoring infrastructure platform that now tracks 5 billion active timeseries in real-time and ingests over 10 trillion samples per day—more than triple the scale of a year ago. The company determined that traditional off-the-shelf monitoring solutions were inefficient at its scale, prompting engineers to develop a new platform that leverages the best of open-source monitoring ecosystems while incorporating customizations for Databricks' unique needs.
The core challenges were managing monitoring across roughly 70 cloud regions spanning AWS, Azure, and Google Cloud while maintaining high reliability and supporting exponential growth driven by serverless and AI workloads. Traditional timeseries databases became a critical bottleneck requiring daily scaling operations. Databricks developed Pantheon, a fork of the CNCF's open-source Thanos project, which now powers 160+ instances globally and handles nearly 1,000 PromQL queries per second on the largest deployments.
The new architecture introduced metric aggregation to manage cardinality explosion caused by rapid infrastructure churn and integrated Databricks' lakehouse for dimensional troubleshooting. The migration to Pantheon reduced monitoring infrastructure downtime by 5x, eliminated significant manual operational toil, and saved millions in annual cloud costs. Databricks has contributed performance optimizations and edge case fixes back to the open-source Thanos community.
- Databricks actively contributes performance optimizations back to the open-source Thanos community
Editorial Opinion
Databricks' decision to build custom infrastructure rather than adopt commercial monitoring solutions reflects a critical threshold in the industry: at hyperscale, generic tools inevitably become a liability. More importantly, the company's commitment to contributing improvements back to Thanos demonstrates how even hyperscale companies can be responsible open-source stewards. As AI infrastructure complexity explodes and data workloads reach planetary scale, expect more companies to follow Databricks' playbook: start with open-source foundations and customize ruthlessly.


