Getting the Full Picture: Unifying Databricks and Cloud Infrastructure Costs

--- Summary: Databricks introduces an open-source Cloud Infra Cost Field Solution to unify Databricks platform costs with cloud infrastructure costs (AWS/Azure/GCP) for complete TCO visibility. The article explains the complexity of measuring Total Cost of Ownership across multicloud environments, where platform costs and cloud infrastructure costs exist in separate data sources (Databricks system tables vs cloud provider billing exports). Key challenges include joining disparate datasets, handling different refresh rates, and parsing high-cardinality tag data.

--- Full Article:

Understanding TCO on Databricks

Understanding the value of your AI and data investments is crucial—yet over 52% of enterprises fail to measure Return on Investment (ROI) rigorously. Complete ROI visibility requires connecting platform usage and cloud infrastructure into a clear financial picture.

On Databricks, customers are managing multicloud, multi-workload and multi-team environments. In these environments, having a consistent, comprehensive view of cost is essential for making informed decisions.

At the core of cost visibility on platforms like Databricks is the concept of Total Cost of Ownership (TCO).

On multicloud data platforms, like Databricks, TCO consists of two core components:

Platform costs, such as compute and managed storage, are costs incurred through direct usage of Databricks products.
Cloud infrastructure costs, such as virtual machines, storage, and networking charges, are costs incurred through the underlying usage of cloud services needed to support Databricks.

Understanding TCO is simplified when using serverless products. Because compute is managed by Databricks, the cloud infrastructure costs are bundled into the Databricks costs, giving you centralized cost visibility directly in Databricks system tables (though storage costs will still be with the cloud provider).

Understanding TCO for classic compute products, however, is more complex. Here, customers manage compute directly with the cloud provider, meaning both Databricks platform costs and cloud infrastructure costs need to be reconciled. In these cases, there are two distinct data sources to be resolved:

System tables in Databricks will provide operational workload-level metadata and Databricks usage.
Cost reports from the cloud provider will detail costs on cloud infrastructure, including discounts.

Together, these sources form the full TCO view.

The Complexity of TCO

The complexity of measuring your Databricks TCO is compounded by the disparate ways cloud providers expose and report cost data.

Azure Databricks: Leveraging First-Party Billing Data

Because Azure Databricks is a first-party service within the Microsoft Azure ecosystem, Databricks-related charges appear directly in Azure Cost Management alongside other Azure services, even including Databricks-specific tags.

However, Azure Cost Management data will not contain the deeper workload-level metadata and performance metrics found in Databricks system tables. Thus, many organizations seek to bring Azure billing exports into Databricks.

Challenges:

Infrastructure must be set up for automated cost exports to ADLS
Azure cost data is aggregated and refreshed daily, unlike system tables which refresh on the order of hours
Joining requires parsing high-cardinality Azure tag data and identifying the right join key (e.g., ClusterId)

Databricks on AWS: Aligning Marketplace and Infrastructure Costs

On AWS, Databricks costs appear in the Cost and Usage Report (CUR) and in AWS Cost Explorer, but costs are represented at a more aggregated, SKU-level. Moreover, Databricks costs appear only in CUR when Databricks is purchased through the AWS Marketplace; otherwise, CUR will reflect only AWS infrastructure costs.

Challenges:

Infrastructure must support recurring CUR reprocessing, since AWS refreshes and replaces cost data multiple times per day (with no primary key)
AWS cost data spans multiple line item types and cost fields, requiring attention to select the correct effective cost per usage type (On-Demand, Savings Plan, Reserved Instances)
Joining CUR with Databricks metadata requires careful attribution, as cardinality can be different (e.g., shared all-purpose clusters are represented as a single AWS usage row but can map to multiple jobs in system tables)

Simplifying Databricks TCO calculations

Common questions teams want to answer:

How does the total cost of a serverless job benchmark against a classic job?
Which clusters, jobs, and warehouses are the biggest consumers of cloud-managed VMs?
How do cost trends change as workloads scale, shift, or consolidate?

To support this need, Databricks is introducing the Cloud Infra Cost Field Solution — an open source solution that automates ingestion and unified analysis of cloud infrastructure and Databricks usage data, inside the Databricks Platform.

By providing a unified foundation for TCO analysis across Databricks serverless and classic compute environments, the Field Solution helps organizations gain clearer cost visibility and understand architectural trade-offs.

Keen's Clippings

Explorer