Last quarter, our Databricks bill was 7,200. Same workloads, same data volumes, same team.

This isn’t a miracle story or a complete platform overhaul. We made seven specific configuration changes based on Databricks’ own documentation and cost optimization guides. Some worked better than expected. One barely made a difference. Here’s exactly what we did, with real numbers and honest assessments.

Our Starting Point: The Baseline

Before jumping into optimizations, here’s what we were running:

Workload Profile:

  • 120 dbt models running nightly (2.5 hours total runtime)
  • 15 streaming jobs processing event data
  • Ad-hoc analytics queries (20–30 per day)
  • ML feature engineering pipelines (weekly batch jobs)

Original Infrastructure:

  • All-purpose clusters: 3 clusters running 8–10 hours/day
  • Job clusters: New cluster per dbt run (cold start every time)
  • Cluster size: Mostly i3.xlarge nodes (4 cores, 30.5 GB RAM)
  • Auto-scaling: Enabled but with wide ranges (2–20 workers)
  • Spot instances: Not used

Monthly Costs (October 2024):

  • Compute: $14,300
  • DBUs (processing): $3,800
  • Storage: $300
  • Total: $18,400

Setting #1: Switched to Photon for SQL-Heavy Workloads

What it is: Photon is Databricks’ vectorized query engine written in C++. It’s faster than standard Spark for SQL queries and often cheaper due to reduced runtime.

What we changed:

-- Old cluster configuration
{
  "cluster_name": "dbt_production",
  "spark_version": "13.3.x-scala2.12",
  "node_type_id": "i3.xlarge",
  "num_workers": 4
}
-- New configuration (Photon enabled)
{
  "cluster_name": "dbt_production",
  "spark_version": "13.3.x-photon-scala2.12",
  "node_type_id": "i3.xlarge",
  "num_workers": 4,
  "enable_elastic_disk": true
}

Results:

  • dbt runtime: 2.5 hours → 1.8 hours (28% faster)
  • Cost per run: 10.80 (14% cheaper despite higher DBU rate)
  • Why it worked: Our dbt models are 95% SQL with lots of aggregations

The math:

  • Standard DBU rate: $0.40/DBU for jobs compute
  • Photon DBU rate: $0.55/DBU (37.5% more expensive per DBU)
  • But runtime reduction more than offset the higher rate

Monthly savings: $340

When this DOESN’T work:

  • Python-heavy workloads (Photon doesn’t accelerate Python UDFs)
  • Simple data loading tasks (no complex queries to optimize)
  • Workloads already highly optimized

Honest assessment: This was our best win. But we had ideal conditions — lots of SQL aggregations and joins. If your pipelines are mostly Python data processing, skip this.

Setting #2: Enabled Spot Instances (With Careful Fallback)

What it is: Spot instances are spare AWS capacity sold at 50–70% discounts. They can be reclaimed with 2 minutes notice.

What we changed:

# Cluster policy for batch jobs
{
  "aws_attributes.availability": {
    "type": "fixed",
    "value": "SPOT_WITH_FALLBACK"
  },
  "aws_attributes.spot_bid_price_percent": {
    "type": "fixed", 
    "value": 100
  }
}

Important: We only enabled this for:

  • Batch dbt jobs (can retry if interrupted)
  • Non-time-sensitive analytics
  • Development/testing clusters

We explicitly AVOIDED spot for:

  • Real-time streaming jobs
  • Critical reporting pipelines with SLAs
  • Interactive notebooks during business hours

Results:

  • Average spot discount: 62%
  • Interruption rate: 3% of job runs (4 out of 120 monthly jobs)
  • Failed jobs automatically retried on on-demand within 5 minutes

The math:

  • Original cost for batch workloads: $8,200/month
  • With spot instances: $3,100/month
  • Savings: $5,100/month

Monthly savings: $5,100

When this DOESN’T work:

  • Jobs that can’t tolerate interruptions
  • Workloads with strict SLAs
  • Instance types with low spot availability (some GPU instances)

Honest assessment: This was our biggest savings by far. But it required discipline — we had to properly separate fault-tolerant workloads from critical ones. The first week we enabled spot everywhere and had production incidents. Learn from our mistake.

Setting #3: Right-Sized Cluster Nodes (Down, Not Up)

What it is: We were using i3.xlarge (4 cores, 30.5 GB RAM) for everything. Most of our workloads didn’t need that much memory.

What we changed:

For dbt and SQL workloads:

From: i3.xlarge (4 cores, 30.5 GB RAM) at $0.312/hour
To: i3.large (2 cores, 15.25 GB RAM) at $0.156/hour
Added: More workers to maintain parallelism

Counter-intuitive result:

  • Old config: 4 workers × 1.248/hour
  • New config: 6 workers × 0.936/hour
  • 25% cheaper with MORE workers because memory wasn’t our bottleneck

Results:

  • Same total compute capacity (8 cores vs 12 cores, but memory-bound tasks ran similarly)
  • Runtime impact: +8% slower (acceptable trade-off)
  • Cost reduction: 25%

The math:

Cluster hours per month: 180 hours
Old cost: 180 × $1.248 = $224.64
New cost: 180 × $0.936 = $168.48
Savings: $56.16 per cluster

We applied this to 4 different job clusters.

Monthly savings: $225

When this DOESN’T work:

  • Memory-intensive operations (large aggregations, caching)
  • ML training workloads
  • Jobs with huge intermediate datasets

Honest assessment: This required actual profiling. We used Databricks’ Ganglia metrics to see that memory utilization was only 40–50%. If you’re already seeing memory pressure, bigger instances might actually be cheaper (faster runtime = less billable hours).

Setting #4: Consolidated Small Jobs into Fewer Clusters

What it is: We had 12 different dbt jobs, each spinning up its own cluster. Cold starts added 3–5 minutes per job.

What we changed:

Before:

# 12 separate jobs, each with own cluster
job_1: models/staging/*
job_2: models/intermediate/*
job_3: models/marts/*
# ... 9 more jobs

After:

# 3 consolidated jobs with proper task dependencies
job_morning:
  - staging models
  - intermediate models (depends on staging)
  - marts models (depends on intermediate)

Results:

  • Cluster cold starts: 12 per day → 3 per day
  • Time saved on startup: 9 × 4 minutes = 36 minutes daily
  • Compute cost savings: ~$0.80/day

The math:

Cold start waste: 36 minutes × 30 days = 18 hours/month
Cost per hour (small cluster): $2.40
Savings: 18 × $2.40 = $43.20/month

Monthly savings: $43

Trade-off: Less granular job control. If one model fails, the entire consolidated job fails. We mitigated this with:

-- dbt_project.yml
on-run-error: continue

When this DOESN’T work:

  • Jobs with very different cluster requirements
  • Teams that need independent job scheduling
  • Workloads where failures shouldn’t affect other tasks

Honest assessment: Modest savings, but eliminated cluster startup waste. The real benefit was simplifying our Workflows dashboard.

Setting #5: Enabled Auto-Scaling with Tighter Bounds

What it is: Auto-scaling adjusts worker count based on workload. We had it enabled but configured poorly.

What we changed:

Old configuration:

"autoscale": {
  "min_workers": 2,
  "max_workers": 20
}

New configuration (based on actual profiling):

"autoscale": {
  "min_workers": 3,
  "max_workers": 8
}

Why this mattered: Our jobs rarely needed more than 8 workers. The 2–20 range meant:

  • Long scale-up times (adding 18 workers takes time)
  • Occasional over-provisioning (cluster scaled to 20 when 8 would suffice)

Results:

  • Reduced over-provisioning waste
  • Faster scaling (smaller range = quicker decisions)
  • More predictable costs

The math: This was hard to measure precisely, but billing analysis showed:

  • Average worker-hours per job dropped 12%
  • Likely due to reduced over-scaling time

Monthly savings: $180 (estimated)

When this DOESN’T work:

  • Highly variable workloads (some jobs need 2 workers, others need 50)
  • When you haven’t profiled actual resource usage

Honest assessment: You need to actually monitor your jobs for a few weeks to set these ranges correctly. We set ours based on Spark UI metrics showing max concurrent tasks. Don’t just copy our numbers.

Setting #6: Switched to Serverless SQL for Ad-Hoc Queries

What it is: Serverless SQL warehouses bill per query execution, not per cluster hour.

What we changed:

From: Always-on SQL warehouse (2X-Small, running 8 hours/day)

Cost: $0.22/DBU × 1 DBU/hour × 8 hours × 22 workdays = $38.72/month
Plus: Compute costs = ~$120/month
Total: ~$159/month

To: Serverless SQL warehouse (starts on-demand)

Cost per query: ~$0.03-0.15 depending on complexity
Average monthly queries: 450
Average cost: ~$40/month

Results:

  • Old cost: $159/month
  • New cost: $40/month
  • Savings: $119/month

Monthly savings: $119

The catch:

  • Cold start time: 1–2 minutes for first query
  • Not suitable if users need instant query results
  • More expensive for high query volumes (break-even around 800 queries/month)

When this DOESN’T work:

  • High query volumes (serverless becomes expensive)
  • Dashboards needing sub-second refresh
  • When consistent performance matters more than cost

Honest assessment: Perfect for our use case (moderate ad-hoc queries). But if your analysts are running 100+ queries daily, a right-sized always-on warehouse might be cheaper.

Setting #7: Implemented Aggressive Cluster Termination

What it is: Clusters auto-terminate after inactivity. We extended this from the 120-minute default to something more aggressive.

What we changed:

Development clusters:

"autotermination_minutes": 30 # Was: 120 Job clusters:

"autotermination_minutes": 10 # Was: 30 The obvious benefit: Less idle cluster time = lower costs

The non-obvious benefit: Forced our team to write more efficient notebooks. When you know your cluster terminates in 30 minutes, you don’t leave expensive queries running and forget about them.

Results: We tracked idle cluster time before/after:

  • Before: ~45 hours/month of idle clusters
  • After: ~8 hours/month

The math:

Saved idle time: 37 hours/month
Average cluster cost: $3.50/hour
Savings: 37 × $3.50 = $129.50/month

Monthly savings: $130

When this DOESN’T work:

  • Interactive development requiring long-running sessions
  • Training/demo environments where quick access matters
  • Jobs with truly long execution times (6+ hours)

Honest assessment: This had a hidden benefit — forced better coding practices. Our engineers started actually profiling queries instead of running things and walking away.

The Complete Impact

Setting Monthly Savings Implementation Difficulty Risk Level Photon for SQL 5,100 Medium (requires job categorization) Medium Right-Sized Nodes 43 Low (workflow refactor) Low Tighter Auto-Scaling 119 Low (settings change) Low Aggressive Termination 6,137**

Actual bill reduction: 12,263 (33% decrease)

Wait, the math doesn’t match?

Savings calculated: 11,200

The difference comes from three sources:

  1. Compounding effects: Photon + spot instances together saved more than either alone
  2. Reduced storage: Shorter job runtimes = less temporary data
  3. Behavioral changes: Team became more cost-conscious after seeing results

What We Learned (The Honest Parts)

1. One Week of Profiling Saved Months of Guessing

We spent a week with Databricks’ cost analyzer and Ganglia metrics before changing anything. This was time well spent. Our initial assumptions about what was expensive were wrong.

Surprises:

  • Our “small” streaming jobs were costing 2x more than dbt because they ran 24/7
  • Development clusters left running overnight cost more than all batch jobs combined
  • One analyst’s weekly report consumed 18% of our total compute

2. Spot Instances Require Operational Maturity

The first time a spot interruption killed a production job at 6 AM, we panicked. Then we realized:

  • The job auto-retried on on-demand
  • Total delay was 7 minutes
  • We saved $170 that day

But this required:

  • Proper job retry configuration
  • Monitoring to catch cascading failures
  • Team discipline to not bypass spot policies

3. Team Buy-In Was Essential

We involved the entire data team in cost optimization:

  • Shared weekly cost dashboards
  • Made cost a visible metric in Slack
  • Celebrated optimizations (not just savings, but elegant solutions)

When engineers saw their names next to “$340 saved this month by optimizing cluster config,” they started proactively looking for waste.

4. Not Every Optimization Worked

We tried several things that failed or had minimal impact:

Failed: Switching to ARM-based instances

  • Promised 20% savings
  • Reality: Compatibility issues with some libraries, 5% performance degradation
  • Rolled back after two weeks

Minimal impact: Table optimization

  • Expected big win from OPTIMIZE and VACUUM commands
  • Actual: 3% query speedup, negligible cost change
  • Conclusion: Still worth doing for performance, not a cost saver

Failed: Pooling clusters

  • Thought we could share clusters across teams
  • Reality: Conflicts over cluster configs, security concerns
  • Ended up costing more time than money saved

Implementation Guide: Start Here

If you’re trying to replicate our results:

Week 1: Baseline and Monitor

  1. Enable cost tracking in Databricks account console
  2. Install Ganglia metrics on all clusters
  3. Run workloads normally, collect data
  4. Identify top 5 cost drivers

Week 2: Low-Risk Changes

  1. Enable Photon on SQL-heavy clusters (test in staging first)
  2. Reduce auto-termination times by 50%
  3. Right-size one non-critical cluster as a test

Week 3: Medium-Risk Changes

  1. Enable spot instances for one batch job
  2. Consolidate 2–3 small jobs
  3. Test serverless SQL for low-volume warehouses

Week 4: Optimization

  1. Adjust auto-scaling based on week 1–3 metrics
  2. Expand spot usage to more workloads
  3. Fine-tune cluster sizes

Month 2+: Continuous Improvement

  • Weekly cost review meetings
  • Monthly optimization sprints
  • Automated alerts for cost anomalies

The Caveats You Should Know

1. Your mileage will vary Our workload is SQL-heavy with forgiving SLAs. If you’re running real-time ML inference or sub-second dashboards, some optimizations won’t apply.

2. Spot savings are regional We’re in us-east-1 where spot availability is good. In regions with less spare capacity, interruption rates could be higher.

3. Team size matters With 5 data engineers, collaborative cost reduction worked. With 50 engineers, you’d need different governance.

4. These aren’t one-time changes Workloads evolve. We review costs monthly and adjust configurations quarterly.

Cost Monitoring Tools We Use

Built-in Databricks:

  • System Tables for usage tracking
  • Cost Explorer in account console
  • Job run duration trends

External:

  • AWS Cost Explorer (for spot vs on-demand breakdown)
  • Slack bot posting daily spend (keeps it visible)
  • Grafana dashboard showing cost per pipeline

Custom alerts:

-- Alert if any job costs >$50
SELECT job_name, SUM(cost_usd) as total_cost
FROM system.billing.usage
WHERE date >= current_date() - 7
GROUP BY job_name
HAVING total_cost > 50

Final Thoughts

We reduced our Databricks bill by 33% in one quarter. But the bigger win was building cost awareness into our engineering culture.

Our team now:

  • Profiles queries before deploying them
  • Questions whether every job needs to run hourly
  • Celebrates efficient code as much as functional code

The specific settings we changed mattered. But the mindset shift — treating compute as a finite resource — mattered more.

Your optimization journey will look different. Your workloads, SLAs, and team dynamics are unique. But the framework applies:

  1. Measure before optimizing
  2. Start with low-risk changes
  3. Involve your team
  4. Monitor continuously
  5. Be honest about failures

And most importantly: document what you did. Your future self (and your team) will thank you.

All cost figures based on AWS us-east-1 pricing as of December 2024. Your costs will vary based on region, instance types, and usage patterns. Settings and configurations tested on Databricks Runtime 13.3 LTS.