Keen's Clippings

❯

Article & paper clippings

❯

I spent 8 hours understanding Apache Spark’s memory management

I spent 8 hours understanding Apache Spark’s memory management

Mar 12, 20264 min read

I spent 8 hours understanding Apache Spark’s memory management | by Vu Trinh | Feb, 2026 | Medium

Sign up

Sign up

Member-only story

I spent 8 hours understanding Apache Spark’s memory management

Here’s everything you need to know

Follow

9 min read

·

4 days ago

126

3

Share

Intro

In 2009, UC Berkeley’s AMPLab developed Spark.

At that time, MapReduce was the go-to choice for processing massive datasets across multiple machines. AMPLab observed that cluster computing had significant potential.

However, MapReduce made building large applications inefficient, especially for machine learning (ML) tasks requiring multiple data passes.

For example, the ML algorithm might need to make many passes over the data. With MapReduce, each pass must be written as a separate job and launched individually on the cluster.

They created Spark. Unlike MapReduce, which writes data to disks after every task, Spark relies on memory processing.

With a more friendly API, supporting wide use cases, and especially efficient in-memory processing, Spark has gained increasing attention and become the dominant solution in data processing.

But, do you know how Spark manages the memory?

This week, I will try to answer this question in the following text. We will revisit some Spark basics before diving into Spark’s memory management.

A Spark Application

Create an account to read the full story.

The author made this story available to Medium members only.

If you’re new to Medium, create a new account to read this story on us.

Continue in app

Or, continue in mobile web

Sign up with Google

Sign up with Facebook

Sign up with email

Already have an account? Sign in

126

126

3

Follow

Written by Vu Trinh -------------------

Follow for practical data engineering articles with self-created illustrations. No AI-writing content

Follow

Responses (3)

Write a response

What are your thoughts?

Cancel

Respond

Well explained

—

Reply

Thanks for saving my 80 hours.

—

Reply

I’m a regular reader of content. Awesome stuff as usual. Which software do you use for creating the images.

—

1 reply

Reply

More from Vu Trinh

Image 10: I spent 5 hours learning Unity Catalog. Here’s everything you need to know.

In

Data Engineer Things

by

I spent 5 hours learning Unity Catalog. Here’s everything you need to know. --------------------------------------------------------------------------- The famous catalog service from Databricks, and it was open-sourced

Jan 21

Image 12: The new observability stack war in 2026

In

Data Engineer Things

by

The new observability stack war in 2026 --------------------------------------- For years, SRE/DevOps and infra felt like two separate lanes.

Jan 12

Image 14: Top 10 Data Engineering Projects That Actually Get You Hired

In

Data Engineer Things

by

B V Sarath Chandra

Top 10 Data Engineering Projects That Actually Get You Hired ------------------------------------------------------------ Most beginners build projects that look great on YouTube thumbnails but are useless on resumes.

Dec 15, 2025

Image 16: To start the DE career again, I will keep these 4 things in mind

In

Data Engineer Things

by

To start the DE career again, I will keep these 4 things in mind ---------------------------------------------------------------- To break into the field quickly and grow more efficiently.

Jan 8

See all from Vu Trinh

Recommended from Medium

Image 18: Databricks Just Dropped 22 Game-Changing Features in January 2026 — Here’s What You’re Missing

Reliable Data Engineering

Databricks Just Dropped 22 Game-Changing Features in January 2026 — Here’s What You’re Missing ---------------------------------------------------------------------------------------------- If you’re still running Databricks like it’s 2025, you’re leaving money, time, and competitive advantage on the table

6d ago

Image 20: Data Engineering Design Patterns You Must Learn in 2026

In

AWS in Plain English

by

Data Engineering Design Patterns You Must Learn in 2026 ------------------------------------------------------- These are the 8 data engineering design patterns every modern data stack is built on. Learn them once, and every data engineering tool…

Jan 5

Image 22: Screenshot of a desktop with the Cursor application open

The 5 paid subscriptions I actually use in 2026 as a Staff Software Engineer ---------------------------------------------------------------------------- Tools I use that are (usually) cheaper than Netflix

Jan 19

Image 24: 6 brain images

In

Write A Catalyst

by

Dr. Patricia Schmidt

As a Neuroscientist, I Quit These 5 Morning Habits That Destroy Your Brain -------------------------------------------------------------------------- Most people do #1 within 10 minutes of waking (and it sabotages your entire day)

Jan 14

Image 26: Stop Memorizing Design Patterns: Use This Decision Tree Instead

In

Women in Technology

by

Alina Kovtun✨

Stop Memorizing Design Patterns: Use This Decision Tree Instead --------------------------------------------------------------- Choose design patterns based on pain points: apply the right pattern with minimal over-engineering in any OO language.

Jan 29

Image 28: LinkedIn Is Replacing Kafka — Here’s Why the Streaming Giant is Moving On

Cloud With Azeem

LinkedIn Is Replacing Kafka — Here’s Why the Streaming Giant is Moving On ------------------------------------------------------------------------- Inside LinkedIn’s Bold Move to a New Data Pipeline That Could Change the Future of Real-Time Streaming

Jan 3

See more recommendations

Graph View

I spent 8 hours understanding Apache Spark’s memory management | by Vu Trinh | Feb, 2026 | Medium
I spent 8 hours understanding Apache Spark’s memory management
Here’s everything you need to know
Intro
A Spark Application
Create an account to read the full story.
Responses (3)
More from Vu Trinh
Recommended from Medium

Created with Quartz v4.5.2 © 2026

GitHub
Discord Community