What is stablegrid.io?

An ed-tech platform for working data engineers and analysts who'd rather understand a query plan than collect another certificate. Five tracks — PySpark, Microsoft Fabric, Apache Airflow, SQL, and Python — taught Junior to Senior on one fictional power-grid scenario. Every track pairs deep theory with server-graded practice.

Who is it for, and do I need prior experience?

Working analysts, junior engineers, and self-taught learners moving toward data engineering. The Junior tier assumes only basic Python or SQL — we teach the systems part. Mid and Senior assume you've shipped real pipelines and want the architectural depth most courses skip.

How is practice graded?

Server-side. Practice tasks check your code against real assertions, not exact-match strings — you can write the transform any way you like as long as the output matches. When you miss, the platform deep-links you straight back to the lesson that taught the missing piece.

Will I get a certificate?

No. The output is fluency you can demonstrate in a query plan or a code review, not a paper credential. If you're learning to pass an exam, this is the wrong platform. If you're learning to ship better data systems, it's built for that.

Free during beta — every tier, every track, no card needed. If you want to back the build once we launch, the Supporter plan is €14.99 paid once, lifetime access, no subscription, no renewals.

stablegrid.io

Handle big data with ease — learn PySpark.

Start free

§ 01The curriculum

One lesson, then thirty more.

Module 2 of 10 · The SparkSession — Your Entry Point

What the SparkSession Does

from pyspark.sql import SparkSession

# Create the single entry point to Spark
spark = SparkSession.builder \
    .appName('nordgrid-daily-pipeline') \
    .getOrCreate()

# Everything flows through the spark object
df = spark.read.csv('bronze/meters/')       # read data
df.createOrReplaceTempView('meters')        # register for SQL
result = spark.sql('SELECT * FROM meters')  # run SQL
spark.conf.get('spark.sql.shuffle.partitions')  # read config

print(type(spark))  # <class 'pyspark.sql.session.SparkSession'>

Thirty modules. Junior to Senior.

300 LESSONS · WRITTEN IN FULL

01 · Junior-Level Track

Core Foundation

01Why Spark Exists
Understand why distributed computing is necessary, how Spark solves the single-machine bottleneck, and how the NordGrid Energy scenario maps to a real-world data engineering context.
Preview lesson 01 →
02The SparkSession — Your Entry Point
Understand the SparkSession as the single entry point to all Spark functionality, learn how to create and configure it, and gain familiarity with the Spark UI as a diagnostic tool.
Preview lesson 01 →
03DataFrames — The Core Abstraction
Understand DataFrames as Spark’s primary distributed data structure, master creation and schema patterns, learn the immutability and lazy evaluation models that make distributed processing safe and fast.
Preview lesson 01 →
04Selecting and Filtering Data
Master the core data shaping operations: choosing which columns to keep, which rows to keep, and how to chain these operations into clean transformation pipelines.
Preview lesson 01 →
05Adding and Transforming Columns
Learn to create new columns, modify existing values, and apply the full range of column-level transformations that turn raw data into analytical assets.
Preview lesson 01 →
06Aggregations and Grouping
Understand how Spark computes summaries across distributed data, master groupBy and agg patterns, and learn how aggregation triggers the shuffle that moves data between executors.
Preview lesson 01 →
07Joins — Combining DataFrames
Understand every join type Spark supports, learn to choose the right join for each data engineering scenario, and handle the column ambiguity problem that joins create.
Preview lesson 01 →
08Reading and Writing Data
Master the DataFrameReader and DataFrameWriter APIs, understand file format characteristics, and build complete read-transform-write pipelines following the NordGrid Bronze-Silver-Gold pattern.
Preview lesson 01 →
09Built-in Functions and Expressions
Gain working knowledge of Spark’s built-in function library, learn to compose functions into complex expressions, and understand when to use a built-in function versus a user-defined function.
Preview lesson 01 →
10Spark SQL — The SQL Interface
Learn to use SQL alongside the DataFrame API, understand when SQL is the clearer choice, and build the hybrid SQL-plus-DataFrame pipelines that most production teams use.
Preview lesson 01 →

02 · Mid-Level Track

Advanced Systems

01Execution Plans and the Catalyst Optimizer
Read and interpret physical execution plans, understand Catalyst optimization rules, and use plan analysis to verify that Spark is executing your pipeline the way you intend.
Preview lesson 01 →
02Partitioning Strategy
Master partition count tuning, understand the trade-offs between repartition and coalesce, and design partitioning schemes that balance read performance, write performance, and storage efficiency.
Preview lesson 01 →
03Join Performance and Shuffle Optimization
Master the performance characteristics of distributed joins — broadcast strategies, shuffle mechanics, skew detection, and the optimizer behaviors that determine whether a join takes seconds or hours.
Preview lesson 01 →
04Schema Evolution and Delta Lake
Understand schema evolution as a production reality, master Delta Lake’s ACID transactions and MERGE operations, and build pipelines that handle schema changes without downtime.
Preview lesson 01 →
05Pipeline Design Patterns
Design idempotent, incremental, and fault-tolerant pipelines that handle real-world production concerns: late-arriving data, backfills, slowly changing dimensions, and multi-stage error recovery.
Preview lesson 01 →
06Testing PySpark Pipelines
Build a comprehensive testing strategy for PySpark code: unit tests for transformations, integration tests for pipelines, data validation tests for output quality, and the test infrastructure that makes testing practical.
Preview lesson 01 →
07Deployment and CI/CD
Automate the path from code change to production deployment: spark-submit automation, Airflow integration patterns, environment management, and CI/CD pipelines that validate PySpark code before it reaches production.
Preview lesson 01 →
08Memory Management and Caching
Understand Spark’s memory architecture, diagnose out-of-memory errors, design effective caching strategies, and configure memory settings that balance performance and stability.
Preview lesson 01 →
09Performance Monitoring and Tuning
Build a systematic approach to performance monitoring: instrument pipelines with metrics, interpret Spark UI data, diagnose common performance problems, and apply targeted tuning techniques.
Preview lesson 01 →
10Structured Streaming Foundations
Understand Spark’s unified batch-streaming model, build micro-batch streaming pipelines for near-real-time data processing, and handle the production concerns of triggers, watermarks, and output modes.
Preview lesson 01 →

03 · Senior-Level Track

Platform Architecture

01Platform Architecture at Scale
Design a multi-team data platform on Spark — catalog structure, namespace conventions, table ownership, layer contracts, and the organizational patterns that prevent a 50-pipeline platform from collapsing under its own weight.
Preview lesson 01 →
02Advanced Catalyst and AQE Internals
Understand the Catalyst optimizer and Adaptive Query Execution at the implementation level — rule ordering, cost models, statistics propagation, and the runtime adaptation mechanisms that determine how Spark actually executes your query.
Preview lesson 01 →
03Partition Strategy at Scale
Design, implement, and maintain partition strategies for multi-terabyte datasets — covering column selection, granularity trade-offs, dynamic partition pruning, small files remediation, and the operational burden of repartitioning a live production pipeline.
Preview lesson 01 →
04Advanced Join Architecture
Design join strategies for multi-terabyte pipelines with complex topologies — bucketed joins, range joins, join elimination, semi-join pushdown, and the architectural decisions that determine whether a multi-hop join pipeline takes 10 minutes or 10 hours.
Preview lesson 01 →
05Schema Registries and Data Contracts
Implement cross-team schema governance at scale — schema registries, producer-consumer contracts, breaking change negotiation, contract testing across team boundaries, and the organizational processes that prevent schema changes from breaking downstream pipelines.
Preview lesson 01 →
06Spark Internals — The Execution Engine
Understand Spark's execution engine at the implementation level — the DAGScheduler, the TaskScheduler, speculative execution, shuffle mechanics, and the resource negotiation that determines how your application shares the cluster with others.
Preview lesson 01 →
07Spark Internals — Storage and I/O
Understand Spark's storage layer at the implementation level — Parquet internals, columnar encoding, predicate evaluation, the DataSource V2 API, and the emerging table format landscape that determines how data is physically stored and accessed.
Preview lesson 01 →
08Multi-Tenant Operations and Security
Design and operate a secure, multi-tenant Spark platform — Unity Catalog deep dive, row-level and column-level security, audit logging, data masking, and the compliance frameworks that govern data access in regulated industries.
Preview lesson 01 →
09Advanced Streaming Architecture
Design production streaming architectures beyond the mid-level foundations — stream-stream joins, Kafka integration, state store management, exactly-once delivery, and the architectural patterns that make streaming reliable at enterprise scale.
Preview lesson 01 →
10Enterprise Governance and Capacity Planning
Establish the organizational and operational frameworks that sustain a multi-team Spark platform over years — data lineage, capacity planning, platform SLA governance, and the team structures that make data engineering reliable at enterprise scale.
Preview lesson 01 →

§ 02The practice

One task, mid-tier. Thirty more behind it.

Practice Set 5 — Delta Lake for Data EngineeringMID · DELTA LAKE · TASK 03 / 06

Diagnose Missing Data Using Time Travel

CONTEXT

NordGrid's billing team reports yesterday's Gold summary shows 15% lower total consumption for NORTH_COAST. Engineers suspect a MERGE corrupted the Silver Delta table.

TASK

Find the version number to compare against, then read the table as it was before yesterday's MERGE.

QUESTION

Which command surfaces the version numbers and operation history you need to pick a comparison point?

§ 03The grid game

Earn kWh from study. Deploy it on the grid.

Sessions and drills earn kWh. Spend them deploying these ten components across the country until the grid is back online.

10 COMPONENTS · 6 CATEGORIES

BACKBONE
PS3 · DataFrames & Schemas
Where high-voltage transmission steps down to a district feeder. In code, where raw landing rows acquire a schema and become a DataFrame you can actually query.
Primary Substation250 kWh
BACKBONE
PS4 · Selecting & Filtering
Voltage matched to local load. In code, columns matched to questions — the operator you reach for to shrink, shape, and route data toward whoever needs it.
Power Transformer375 kWh
PROTECTION
PM4 · Validation & Bad Data
Trips faulted lines before damage cascades. In code, the schema enforcement and null/dup guards that stop one bad CSV from corrupting the whole pipeline.
Protective Relay800 kWh
STORAGE
PM5 · Delta Lake
Holds energy until peak demand. Delta holds state — ACID writes, time travel, MERGE — so yesterday's pipeline run is still queryable tomorrow.
Battery Storage1200 kWh
BALANCING
PM2 · Window Functions
Corrects reactive power so the grid stays in phase. Window functions correct row-by-row context so totals, ranks, and rolling sums stay in step with the data.
Capacitor Bank450 kWh
PROTECTION
PS9 · Error Handling
Opens the moment a fault is detected. In code, the try/except, checkpoint, and dead-letter routing that isolate one bad batch from the next.
Circuit Breaker320 kWh
COMMAND
PX3 · Spark UI & Tuning
The mimic board operators read in real time. The Spark UI is yours — stage timelines, executor heatmaps, SQL DAGs — once you know how to read it.
Control Center2400 kWh
BALANCING
PX1 · Catalyst & AQE
Converts DC to AC on the fly, optimising for whatever the grid needs. Adaptive Query Execution rewrites your plan at runtime for the data it actually sees.
Smart Inverter680 kWh
GENERATION
PM7 · Streaming
Generates power as long as the source flows. Structured Streaming treats the source the same way — every micro-batch a new arrival on the same pipeline.
Solar Array950 kWh
GENERATION
PX5 · Partitioning & Skew
Many small generators, one combined output. Many small partitions, one combined job — but only if the wind is even. Skew is what wrecks both.
Wind Cluster1100 kWh