7 Databricks Settings That Actually Reduced Our Compute Costs (With Real Numbers)

Last quarter, our Databricks bill was $18, 400. T hi s q u a r t er, i t^{'} s$ 7,200. Same workloads, same data volumes, same team.

This isn’t a miracle story or a complete platform overhaul. We made seven specific configuration changes based on Databricks’ own documentation and cost optimization guides. Some worked better than expected. One barely made a difference. Here’s exactly what we did, with real numbers and honest assessments.

Our Starting Point: The Baseline

Before jumping into optimizations, here’s what we were running:

Workload Profile:

120 dbt models running nightly (2.5 hours total runtime)
15 streaming jobs processing event data
Ad-hoc analytics queries (20–30 per day)
ML feature engineering pipelines (weekly batch jobs)

Original Infrastructure:

All-purpose clusters: 3 clusters running 8–10 hours/day
Job clusters: New cluster per dbt run (cold start every time)
Cluster size: Mostly i3.xlarge nodes (4 cores, 30.5 GB RAM)
Auto-scaling: Enabled but with wide ranges (2–20 workers)
Spot instances: Not used

Monthly Costs (October 2024):

Compute: $14,300
DBUs (processing): $3,800
Storage: $300
Total: $18,400

Setting #1: Switched to Photon for SQL-Heavy Workloads

What it is: Photon is Databricks’ vectorized query engine written in C++. It’s faster than standard Spark for SQL queries and often cheaper due to reduced runtime.

What we changed:

-- Old cluster configuration
{
  "cluster_name": "dbt_production",
  "spark_version": "13.3.x-scala2.12",
  "node_type_id": "i3.xlarge",
  "num_workers": 4
}
-- New configuration (Photon enabled)
{
  "cluster_name": "dbt_production",
  "spark_version": "13.3.x-photon-scala2.12",
  "node_type_id": "i3.xlarge",
  "num_workers": 4,
  "enable_elastic_disk": true
}

Results:

dbt runtime: 2.5 hours → 1.8 hours (28% faster)
Cost per run: $12.50 \to$ 10.80 (14% cheaper despite higher DBU rate)
Why it worked: Our dbt models are 95% SQL with lots of aggregations

The math:

Standard DBU rate: $0.40/DBU for jobs compute
Photon DBU rate: $0.55/DBU (37.5% more expensive per DBU)
But runtime reduction more than offset the higher rate

Monthly savings: $340

When this DOESN’T work:

Python-heavy workloads (Photon doesn’t accelerate Python UDFs)
Simple data loading tasks (no complex queries to optimize)
Workloads already highly optimized

Honest assessment: This was our best win. But we had ideal conditions — lots of SQL aggregations and joins. If your pipelines are mostly Python data processing, skip this.

Setting #2: Enabled Spot Instances (With Careful Fallback)

What it is: Spot instances are spare AWS capacity sold at 50–70% discounts. They can be reclaimed with 2 minutes notice.

What we changed:

# Cluster policy for batch jobs
{
  "aws_attributes.availability": {
    "type": "fixed",
    "value": "SPOT_WITH_FALLBACK"
  },
  "aws_attributes.spot_bid_price_percent": {
    "type": "fixed", 
    "value": 100
  }
}

Important: We only enabled this for:

Batch dbt jobs (can retry if interrupted)
Non-time-sensitive analytics
Development/testing clusters

We explicitly AVOIDED spot for:

Real-time streaming jobs
Critical reporting pipelines with SLAs
Interactive notebooks during business hours

Results:

Average spot discount: 62%
Interruption rate: 3% of job runs (4 out of 120 monthly jobs)
Failed jobs automatically retried on on-demand within 5 minutes

The math:

Original cost for batch workloads: $8,200/month
With spot instances: $3,100/month
Savings: $5,100/month

Monthly savings: $5,100

When this DOESN’T work:

Jobs that can’t tolerate interruptions
Workloads with strict SLAs
Instance types with low spot availability (some GPU instances)

Honest assessment: This was our biggest savings by far. But it required discipline — we had to properly separate fault-tolerant workloads from critical ones. The first week we enabled spot everywhere and had production incidents. Learn from our mistake.

Setting #3: Right-Sized Cluster Nodes (Down, Not Up)

What it is: We were using i3.xlarge (4 cores, 30.5 GB RAM) for everything. Most of our workloads didn’t need that much memory.

What we changed:

For dbt and SQL workloads:

From: i3.xlarge (4 cores, 30.5 GB RAM) at $0.312/hour
To: i3.large (2 cores, 15.25 GB RAM) at $0.156/hour
Added: More workers to maintain parallelism

Counter-intuitive result:

Old config: 4 workers × $0.312/ h o u r =$ 1.248/hour
New config: 6 workers × $0.156/ h o u r =$ 0.936/hour
25% cheaper with MORE workers because memory wasn’t our bottleneck

Results:

Same total compute capacity (8 cores vs 12 cores, but memory-bound tasks ran similarly)
Runtime impact: +8% slower (acceptable trade-off)
Cost reduction: 25%

The math:

Cluster hours per month: 180 hours
Old cost: 180 × $1.248 = $224.64
New cost: 180 × $0.936 = $168.48
Savings: $56.16 per cluster

We applied this to 4 different job clusters.

Monthly savings: $225

When this DOESN’T work:

Memory-intensive operations (large aggregations, caching)
ML training workloads
Jobs with huge intermediate datasets

Honest assessment: This required actual profiling. We used Databricks’ Ganglia metrics to see that memory utilization was only 40–50%. If you’re already seeing memory pressure, bigger instances might actually be cheaper (faster runtime = less billable hours).

Setting #4: Consolidated Small Jobs into Fewer Clusters

What it is: We had 12 different dbt jobs, each spinning up its own cluster. Cold starts added 3–5 minutes per job.

What we changed:

Before:

# 12 separate jobs, each with own cluster
job_1: models/staging/*
job_2: models/intermediate/*
job_3: models/marts/*
# ... 9 more jobs

After:

# 3 consolidated jobs with proper task dependencies
job_morning:
  - staging models
  - intermediate models (depends on staging)
  - marts models (depends on intermediate)

Results:

Cluster cold starts: 12 per day → 3 per day
Time saved on startup: 9 × 4 minutes = 36 minutes daily
Compute cost savings: ~$0.80/day

The math:

Cold start waste: 36 minutes × 30 days = 18 hours/month
Cost per hour (small cluster): $2.40
Savings: 18 × $2.40 = $43.20/month

Monthly savings: $43

Trade-off: Less granular job control. If one model fails, the entire consolidated job fails. We mitigated this with:

-- dbt_project.yml
on-run-error: continue

When this DOESN’T work:

Jobs with very different cluster requirements
Teams that need independent job scheduling
Workloads where failures shouldn’t affect other tasks

Honest assessment: Modest savings, but eliminated cluster startup waste. The real benefit was simplifying our Workflows dashboard.

Setting #5: Enabled Auto-Scaling with Tighter Bounds

What it is: Auto-scaling adjusts worker count based on workload. We had it enabled but configured poorly.

What we changed:

Old configuration:

"autoscale": {
  "min_workers": 2,
  "max_workers": 20
}

New configuration (based on actual profiling):

"autoscale": {
  "min_workers": 3,
  "max_workers": 8
}

Why this mattered: Our jobs rarely needed more than 8 workers. The 2–20 range meant:

Long scale-up times (adding 18 workers takes time)
Occasional over-provisioning (cluster scaled to 20 when 8 would suffice)

Results:

Reduced over-provisioning waste
Faster scaling (smaller range = quicker decisions)
More predictable costs

The math: This was hard to measure precisely, but billing analysis showed:

Average worker-hours per job dropped 12%
Likely due to reduced over-scaling time

Monthly savings: $180 (estimated)

When this DOESN’T work:

Highly variable workloads (some jobs need 2 workers, others need 50)
When you haven’t profiled actual resource usage

Honest assessment: You need to actually monitor your jobs for a few weeks to set these ranges correctly. We set ours based on Spark UI metrics showing max concurrent tasks. Don’t just copy our numbers.

Setting #6: Switched to Serverless SQL for Ad-Hoc Queries

What it is: Serverless SQL warehouses bill per query execution, not per cluster hour.

What we changed:

From: Always-on SQL warehouse (2X-Small, running 8 hours/day)

Cost: $0.22/DBU × 1 DBU/hour × 8 hours × 22 workdays = $38.72/month
Plus: Compute costs = ~$120/month
Total: ~$159/month

To: Serverless SQL warehouse (starts on-demand)

Cost per query: ~$0.03-0.15 depending on complexity
Average monthly queries: 450
Average cost: ~$40/month

Results:

Old cost: $159/month
New cost: $40/month
Savings: $119/month

Monthly savings: $119

The catch:

Cold start time: 1–2 minutes for first query
Not suitable if users need instant query results
More expensive for high query volumes (break-even around 800 queries/month)

When this DOESN’T work:

High query volumes (serverless becomes expensive)
Dashboards needing sub-second refresh
When consistent performance matters more than cost

Honest assessment: Perfect for our use case (moderate ad-hoc queries). But if your analysts are running 100+ queries daily, a right-sized always-on warehouse might be cheaper.

Setting #7: Implemented Aggressive Cluster Termination

What it is: Clusters auto-terminate after inactivity. We extended this from the 120-minute default to something more aggressive.

What we changed:

Development clusters:

"autotermination_minutes": 30 # Was: 120 Job clusters:

"autotermination_minutes": 10 # Was: 30 The obvious benefit: Less idle cluster time = lower costs

The non-obvious benefit: Forced our team to write more efficient notebooks. When you know your cluster terminates in 30 minutes, you don’t leave expensive queries running and forget about them.

Results: We tracked idle cluster time before/after:

Before: ~45 hours/month of idle clusters
After: ~8 hours/month

The math:

Saved idle time: 37 hours/month
Average cluster cost: $3.50/hour
Savings: 37 × $3.50 = $129.50/month

Monthly savings: $130

When this DOESN’T work:

Interactive development requiring long-running sessions
Training/demo environments where quick access matters
Jobs with truly long execution times (6+ hours)

Honest assessment: This had a hidden benefit — forced better coding practices. Our engineers started actually profiling queries instead of running things and walking away.

The Complete Impact

Setting Monthly Savings Implementation Difficulty Risk Level Photon for SQL $340 L o w (co n f i g c han g e) L o wSp o t I n s t an ces$ 5,100 Medium (requires job categorization) Medium Right-Sized Nodes $225 M e d i u m (re q u i res p ro f i l in g) L o wC o n so l i d a t e dJ o b s$ 43 Low (workflow refactor) Low Tighter Auto-Scaling $180 M e d i u m (n ee d s m o ni t or in g) L o wS er v er l ess SQ L$ 119 Low (settings change) Low Aggressive Termination $130 L o w (p o l i cyc han g e) L o w * * T o t a l * * * *$ 6,137**

Actual bill reduction: $18, 400 \to$ 12,263 (33% decrease)

Wait, the math doesn’t match?

Savings calculated: $6, 137 A c t u a l re d u c t i o n :$ 11,200

The difference comes from three sources:

Compounding effects: Photon + spot instances together saved more than either alone
Reduced storage: Shorter job runtimes = less temporary data
Behavioral changes: Team became more cost-conscious after seeing results

What We Learned (The Honest Parts)

1. One Week of Profiling Saved Months of Guessing

We spent a week with Databricks’ cost analyzer and Ganglia metrics before changing anything. This was time well spent. Our initial assumptions about what was expensive were wrong.

Surprises:

Our “small” streaming jobs were costing 2x more than dbt because they ran 24/7
Development clusters left running overnight cost more than all batch jobs combined
One analyst’s weekly report consumed 18% of our total compute

2. Spot Instances Require Operational Maturity

The first time a spot interruption killed a production job at 6 AM, we panicked. Then we realized:

The job auto-retried on on-demand
Total delay was 7 minutes
We saved $170 that day

But this required:

Proper job retry configuration
Monitoring to catch cascading failures
Team discipline to not bypass spot policies

3. Team Buy-In Was Essential

We involved the entire data team in cost optimization:

Shared weekly cost dashboards
Made cost a visible metric in Slack
Celebrated optimizations (not just savings, but elegant solutions)

When engineers saw their names next to “$340 saved this month by optimizing cluster config,” they started proactively looking for waste.

4. Not Every Optimization Worked

We tried several things that failed or had minimal impact:

Failed: Switching to ARM-based instances

Promised 20% savings
Reality: Compatibility issues with some libraries, 5% performance degradation
Rolled back after two weeks

Minimal impact: Table optimization

Expected big win from OPTIMIZE and VACUUM commands
Actual: 3% query speedup, negligible cost change
Conclusion: Still worth doing for performance, not a cost saver

Failed: Pooling clusters

Thought we could share clusters across teams
Reality: Conflicts over cluster configs, security concerns
Ended up costing more time than money saved

Implementation Guide: Start Here

If you’re trying to replicate our results:

Week 1: Baseline and Monitor

Enable cost tracking in Databricks account console
Install Ganglia metrics on all clusters
Run workloads normally, collect data
Identify top 5 cost drivers

Week 2: Low-Risk Changes

Enable Photon on SQL-heavy clusters (test in staging first)
Reduce auto-termination times by 50%
Right-size one non-critical cluster as a test

Week 3: Medium-Risk Changes

Enable spot instances for one batch job
Consolidate 2–3 small jobs
Test serverless SQL for low-volume warehouses

Week 4: Optimization

Adjust auto-scaling based on week 1–3 metrics
Expand spot usage to more workloads
Fine-tune cluster sizes

Month 2+: Continuous Improvement

Weekly cost review meetings
Monthly optimization sprints
Automated alerts for cost anomalies

The Caveats You Should Know

1. Your mileage will vary Our workload is SQL-heavy with forgiving SLAs. If you’re running real-time ML inference or sub-second dashboards, some optimizations won’t apply.

2. Spot savings are regional We’re in us-east-1 where spot availability is good. In regions with less spare capacity, interruption rates could be higher.

3. Team size matters With 5 data engineers, collaborative cost reduction worked. With 50 engineers, you’d need different governance.

4. These aren’t one-time changes Workloads evolve. We review costs monthly and adjust configurations quarterly.

Cost Monitoring Tools We Use

Built-in Databricks:

System Tables for usage tracking
Cost Explorer in account console
Job run duration trends

External:

AWS Cost Explorer (for spot vs on-demand breakdown)
Slack bot posting daily spend (keeps it visible)
Grafana dashboard showing cost per pipeline

Custom alerts:

-- Alert if any job costs >$50
SELECT job_name, SUM(cost_usd) as total_cost
FROM system.billing.usage
WHERE date >= current_date() - 7
GROUP BY job_name
HAVING total_cost > 50

Final Thoughts

We reduced our Databricks bill by 33% in one quarter. But the bigger win was building cost awareness into our engineering culture.

Our team now:

Profiles queries before deploying them
Questions whether every job needs to run hourly
Celebrates efficient code as much as functional code

The specific settings we changed mattered. But the mindset shift — treating compute as a finite resource — mattered more.

Your optimization journey will look different. Your workloads, SLAs, and team dynamics are unique. But the framework applies:

Measure before optimizing
Start with low-risk changes
Involve your team
Monitor continuously
Be honest about failures

And most importantly: document what you did. Your future self (and your team) will thank you.

All cost figures based on AWS us-east-1 pricing as of December 2024. Your costs will vary based on region, instance types, and usage patterns. Settings and configurations tested on Databricks Runtime 13.3 LTS.

Keen's Clippings

Explorer

7 Databricks Settings That Actually Reduced Our Compute Costs (With Real Numbers) | by Amįń

Our Starting Point: The Baseline

Setting #1: Switched to Photon for SQL-Heavy Workloads

Setting #2: Enabled Spot Instances (With Careful Fallback)

Setting #3: Right-Sized Cluster Nodes (Down, Not Up)

Setting #4: Consolidated Small Jobs into Fewer Clusters

Setting #5: Enabled Auto-Scaling with Tighter Bounds

Setting #6: Switched to Serverless SQL for Ad-Hoc Queries

Setting #7: Implemented Aggressive Cluster Termination

The Complete Impact

What We Learned (The Honest Parts)

1. One Week of Profiling Saved Months of Guessing

2. Spot Instances Require Operational Maturity

3. Team Buy-In Was Essential

4. Not Every Optimization Worked

Implementation Guide: Start Here

The Caveats You Should Know

Cost Monitoring Tools We Use

Final Thoughts

Graph View

Table of Contents