BotBeat
...
← Back

> ▌

DatabricksDatabricks
RESEARCHDatabricks2026-05-05

Databricks Scales to 10 Trillion Monitoring Samples Per Day With Custom Infrastructure

Key Takeaways

  • ▸Databricks' monitoring infrastructure now ingests 10 trillion samples daily and tracks 5 billion active timeseries across 70 cloud regions
  • ▸The company built Pantheon, a custom fork of open-source Thanos TSDB, because off-the-shelf solutions couldn't handle scale and complexity requirements
  • ▸Infrastructure improvements reduced monitoring downtime by 5x and eliminated millions in annual cloud costs
Source:
Hacker Newshttps://www.databricks.com/blog/10-trillion-samples-day-scaling-beyond-traditional-monitoring-infra-databricks↗

Summary

Databricks has shared details of its custom monitoring infrastructure platform that now tracks 5 billion active timeseries in real-time and ingests over 10 trillion samples per day—more than triple the scale of a year ago. The company determined that traditional off-the-shelf monitoring solutions were inefficient at its scale, prompting engineers to develop a new platform that leverages the best of open-source monitoring ecosystems while incorporating customizations for Databricks' unique needs.

The core challenges were managing monitoring across roughly 70 cloud regions spanning AWS, Azure, and Google Cloud while maintaining high reliability and supporting exponential growth driven by serverless and AI workloads. Traditional timeseries databases became a critical bottleneck requiring daily scaling operations. Databricks developed Pantheon, a fork of the CNCF's open-source Thanos project, which now powers 160+ instances globally and handles nearly 1,000 PromQL queries per second on the largest deployments.

The new architecture introduced metric aggregation to manage cardinality explosion caused by rapid infrastructure churn and integrated Databricks' lakehouse for dimensional troubleshooting. The migration to Pantheon reduced monitoring infrastructure downtime by 5x, eliminated significant manual operational toil, and saved millions in annual cloud costs. Databricks has contributed performance optimizations and edge case fixes back to the open-source Thanos community.

  • Databricks actively contributes performance optimizations back to the open-source Thanos community

Editorial Opinion

Databricks' decision to build custom infrastructure rather than adopt commercial monitoring solutions reflects a critical threshold in the industry: at hyperscale, generic tools inevitably become a liability. More importantly, the company's commitment to contributing improvements back to Thanos demonstrates how even hyperscale companies can be responsible open-source stewards. As AI infrastructure complexity explodes and data workloads reach planetary scale, expect more companies to follow Databricks' playbook: start with open-source foundations and customize ruthlessly.

Data Science & AnalyticsMLOps & InfrastructureOpen Source

More from Databricks

DatabricksDatabricks
INDUSTRY REPORT

The Enterprise AI Data Crisis: Why Your Data Stack Matters More Than Your Model

2026-04-29
DatabricksDatabricks
RESEARCH

Databricks Introduces Memory Scaling for AI Agents: A New Frontier Beyond Model Size

2026-04-18
DatabricksDatabricks
FUNDING & BUSINESS

Databricks Co-founder Matei Zaharia Wins 2026 ACM Prize, Declares 'AGI is Here Already'

2026-04-08

Comments

Suggested

AnthropicAnthropic
OPEN SOURCE

Anthropic Releases Prempti: Open-Source Guardrails for AI Coding Agents

2026-05-12
vlm-runvlm-run
OPEN SOURCE

mm-ctx: Open-Source Multimodal CLI Toolkit Brings Vision Capabilities to AI Agents

2026-05-12
AnthropicAnthropic
PRODUCT LAUNCH

Anthropic Unleashes Computer Use: Claude 3.5 Sonnet Now Controls Your Desktop

2026-05-12
← Back to news
© 2026 BotBeat
AboutPrivacy PolicyTerms of ServiceContact Us