Every data pipeline starts on one machine. You write a Python script, it reads a file, transforms the data, and writes the result.
For thousands of rows this works perfectly — the script finishes in seconds, memory usage barely registers, and the disk handles the read-write cycle without complaint. But data grows.
NordGrid Energy, a Northern European grid operator, started with 10,000 smart meters reporting daily energy consumption. Five years later, they manage 500,000 meters.
The daily readings file grew from 2 megabytes to 1.5 gigabytes. The Python script that once finished in 4 seconds now takes 35 minutes.
This lesson is about why that slowdown happens and what it tells us about the limits of a single computer.
A computer has three physical resources that determine how fast it can process data: the disk (where data is stored), memory (where data is held while being worked on), and the CPU (where computations happen). Each of these resources has a maximum throughput — a speed limit that no software optimization can exceed.
When your data grows large enough to hit any one of these limits, your script slows down dramatically. These limits are sometimes called the three walls of single-machine processing, and understanding them is the first step toward understanding why distributed computing exists.
Data on Disk Memory (RAM) CPU
┌──────────────────────┐ ┌────────────────────┐ ┌────────────────────┐
│ 1.5 GB daily file │ │ Must fit in RAM │ │ Processes rows │
│ │ │ to be processed │ │ one at a time │
│ Read speed: │ │ │ │ (per core) │
│ ~500 MB/s (SSD) │ │ Typical server: │ │ │
│ ~100 MB/s (HDD) │ │ 16–64 GB │ │ Typical server: │
│ │ │ │ │ 4–16 cores │
└──────────────────────┘ └────────────────────┘ └────────────────────┘
│ │ │
Wall 1: Wall 2: Wall 3:
Disk I/O Memory Compute
bottleneck capacity throughputThe first wall is usually the disk. Before your script can transform any data, it must read that data from the hard drive or SSD into memory.
A modern SSD reads at roughly 500 megabytes per second. That sounds fast, but NordGrid’s 1.5 GB daily file takes 3 seconds just to load into memory — before any processing begins.
When NordGrid runs a monthly summary that reads 30 days of files, that is 45 GB of data. At 500 MB/s, reading alone takes 90 seconds.
On an older spinning hard drive at 100 MB/s, it takes 7.5 minutes. Writing results back to disk doubles the I/O time.
For a simple read-transform-write pipeline, the disk often accounts for more than half the total runtime.
Think of the disk as a warehouse door. No matter how many workers you have inside the warehouse, goods can only enter and leave through that one door at a fixed rate.
A wider door (a faster SSD) helps, but there is a physical limit to how wide one door can be. When data exceeds what the door can handle in a reasonable time, the only option is to have multiple doors — multiple machines, each with its own disk, reading their own portion of the data simultaneously.
The second wall is memory. When your Python script reads a CSV file into a Pandas DataFrame, the entire dataset must fit in RAM.
A 1.5 GB CSV file typically expands to 3–5 GB in memory because Python objects carry overhead — each string, each number, each row structure occupies more memory than the raw bytes on disk. A server with 16 GB of RAM can hold one day’s NordGrid data comfortably.
But a monthly analysis (45 GB on disk, 90–150 GB in memory) is physically impossible on that machine. The data simply does not fit.
Your script crashes with a MemoryError, and no amount of code optimization can fix a problem that is fundamentally about physical capacity.
You can buy a server with 256 GB or even 1 TB of RAM. But bigger hardware is expensive, has diminishing returns, and eventually hits its own ceiling. A single machine with 1 TB of RAM still cannot process 2 TB of data in memory. And that 1 TB machine costs 10x more than ten 128 GB machines combined. This economic reality is one of the reasons distributed computing exists.
The third wall is the CPU. Even if the data fits in memory and the disk is fast enough, the CPU must process every row.
A single CPU core can typically evaluate a few hundred million simple operations per second. For NordGrid’s 500,000 daily readings, that is plenty — a single core handles the computation in milliseconds.
But consider the annual reprocessing job: 500,000 meters × 365 days = 182.5 million rows. Add a complex transformation — parsing dates, computing rolling averages, joining with a reference table — and each row requires hundreds of operations.
The total compute time stretches to minutes, then tens of minutes, as the data grows.
Modern servers have multiple cores (4, 8, 16, or more), and well-written programs can use them in parallel. But writing correct multi-threaded code in Python is difficult and error-prone, and Python’s Global Interpreter Lock limits true parallelism for many operations.
Even without Python’s limitations, a single server’s core count is fixed. When the compute requirement exceeds what 16 cores can deliver, you need more cores — which means more machines.
Year 1 Year 2 Year 3 Year 4 Year 5
10K meters 50K meters 150K meters 300K meters 500K meters
2 MB/day 15 MB/day 200 MB/day 800 MB/day 1.5 GB/day
█ ██ █████ █████████ ██████████████
4 seconds 20 seconds 3 minutes 12 minutes 35 minutes
Python OK Python OK Python slow Python pain Python limit
─────────────
▲ Something
must changeNordGrid’s story is not unusual. Every organization that collects data at scale eventually hits the same walls.
The data grows gradually — 5% more meters this quarter, a new sensor type next quarter, a regulatory requirement to retain five years of history instead of two. Each increment is manageable on its own.
But the compound effect is exponential growth that eventually overwhelms any single machine. The question is not whether you will hit the walls, but when.
Distributed computing is the answer to that question, and Apache Spark is the tool that makes it practical. In the next lesson, you will see how the simple idea of using multiple machines together transforms this problem completely.
The bottleneck is not Python being slow. Python is slower than Java or C++, but the fundamental limits are physical: disk read speed, memory capacity, and core count. Rewriting NordGrid’s pipeline in a faster language would help, but it would not change the fact that one machine cannot read 45 GB of data faster than its disk allows, or hold 150 GB of data in 16 GB of RAM.