stablegrid.io

Handle big data with ease — learn PySpark.

§ 01The curriculum

One lesson, then thirty more.

Module 2 of 10 · The SparkSession — Your Entry Point

What the SparkSession Does

from pyspark.sql import SparkSession

# Create the single entry point to Spark
spark = SparkSession.builder \
    .appName('nordgrid-daily-pipeline') \
    .getOrCreate()

# Everything flows through the spark object
df = spark.read.csv('bronze/meters/')       # read data
df.createOrReplaceTempView('meters')        # register for SQL
result = spark.sql('SELECT * FROM meters')  # run SQL
spark.conf.get('spark.sql.shuffle.partitions')  # read config

print(type(spark))  # <class 'pyspark.sql.session.SparkSession'>

Thirty modules. Junior to Senior.

300 LESSONS · WRITTEN IN FULL
01 · Junior-Level Track

Core Foundation

  1. 01Why Spark Exists

    Understand why distributed computing is necessary, how Spark solves the single-machine bottleneck, and how the NordGrid Energy scenario maps to a real-world data engineering context.

    Preview lesson 01 →
  2. 02The SparkSession — Your Entry Point

    Understand the SparkSession as the single entry point to all Spark functionality, learn how to create and configure it, and gain familiarity with the Spark UI as a diagnostic tool.

    Preview lesson 01 →
  3. 03DataFrames — The Core Abstraction

    Understand DataFrames as Spark’s primary distributed data structure, master creation and schema patterns, learn the immutability and lazy evaluation models that make distributed processing safe and fast.

    Preview lesson 01 →
  4. 04Selecting and Filtering Data

    Master the core data shaping operations: choosing which columns to keep, which rows to keep, and how to chain these operations into clean transformation pipelines.

    Preview lesson 01 →
  5. 05Adding and Transforming Columns

    Learn to create new columns, modify existing values, and apply the full range of column-level transformations that turn raw data into analytical assets.

    Preview lesson 01 →
  6. 06Aggregations and Grouping

    Understand how Spark computes summaries across distributed data, master groupBy and agg patterns, and learn how aggregation triggers the shuffle that moves data between executors.

    Preview lesson 01 →
  7. 07Joins — Combining DataFrames

    Understand every join type Spark supports, learn to choose the right join for each data engineering scenario, and handle the column ambiguity problem that joins create.

    Preview lesson 01 →
  8. 08Reading and Writing Data

    Master the DataFrameReader and DataFrameWriter APIs, understand file format characteristics, and build complete read-transform-write pipelines following the NordGrid Bronze-Silver-Gold pattern.

    Preview lesson 01 →
  9. 09Built-in Functions and Expressions

    Gain working knowledge of Spark’s built-in function library, learn to compose functions into complex expressions, and understand when to use a built-in function versus a user-defined function.

    Preview lesson 01 →
  10. 10Spark SQL — The SQL Interface

    Learn to use SQL alongside the DataFrame API, understand when SQL is the clearer choice, and build the hybrid SQL-plus-DataFrame pipelines that most production teams use.

    Preview lesson 01 →
02 · Mid-Level Track

Advanced Systems

  1. 01Execution Plans and the Catalyst Optimizer

    Read and interpret physical execution plans, understand Catalyst optimization rules, and use plan analysis to verify that Spark is executing your pipeline the way you intend.

    Preview lesson 01 →
  2. 02Partitioning Strategy

    Master partition count tuning, understand the trade-offs between repartition and coalesce, and design partitioning schemes that balance read performance, write performance, and storage efficiency.

    Preview lesson 01 →
  3. 03Join Performance and Shuffle Optimization

    Master the performance characteristics of distributed joins — broadcast strategies, shuffle mechanics, skew detection, and the optimizer behaviors that determine whether a join takes seconds or hours.

    Preview lesson 01 →
  4. 04Schema Evolution and Delta Lake

    Understand schema evolution as a production reality, master Delta Lake’s ACID transactions and MERGE operations, and build pipelines that handle schema changes without downtime.

    Preview lesson 01 →
  5. 05Pipeline Design Patterns

    Design idempotent, incremental, and fault-tolerant pipelines that handle real-world production concerns: late-arriving data, backfills, slowly changing dimensions, and multi-stage error recovery.

    Preview lesson 01 →
  6. 06Testing PySpark Pipelines

    Build a comprehensive testing strategy for PySpark code: unit tests for transformations, integration tests for pipelines, data validation tests for output quality, and the test infrastructure that makes testing practical.

    Preview lesson 01 →
  7. 07Deployment and CI/CD

    Automate the path from code change to production deployment: spark-submit automation, Airflow integration patterns, environment management, and CI/CD pipelines that validate PySpark code before it reaches production.

    Preview lesson 01 →
  8. 08Memory Management and Caching

    Understand Spark’s memory architecture, diagnose out-of-memory errors, design effective caching strategies, and configure memory settings that balance performance and stability.

    Preview lesson 01 →
  9. 09Performance Monitoring and Tuning

    Build a systematic approach to performance monitoring: instrument pipelines with metrics, interpret Spark UI data, diagnose common performance problems, and apply targeted tuning techniques.

    Preview lesson 01 →
  10. 10Structured Streaming Foundations

    Understand Spark’s unified batch-streaming model, build micro-batch streaming pipelines for near-real-time data processing, and handle the production concerns of triggers, watermarks, and output modes.

    Preview lesson 01 →
03 · Senior-Level Track

Platform Architecture

  1. 01Platform Architecture at Scale

    Design a multi-team data platform on Spark — catalog structure, namespace conventions, table ownership, layer contracts, and the organizational patterns that prevent a 50-pipeline platform from collapsing under its own weight.

    Preview lesson 01 →
  2. 02Advanced Catalyst and AQE Internals

    Understand the Catalyst optimizer and Adaptive Query Execution at the implementation level — rule ordering, cost models, statistics propagation, and the runtime adaptation mechanisms that determine how Spark actually executes your query.

    Preview lesson 01 →
  3. 03Partition Strategy at Scale

    Design, implement, and maintain partition strategies for multi-terabyte datasets — covering column selection, granularity trade-offs, dynamic partition pruning, small files remediation, and the operational burden of repartitioning a live production pipeline.

    Preview lesson 01 →
  4. 04Advanced Join Architecture

    Design join strategies for multi-terabyte pipelines with complex topologies — bucketed joins, range joins, join elimination, semi-join pushdown, and the architectural decisions that determine whether a multi-hop join pipeline takes 10 minutes or 10 hours.

    Preview lesson 01 →
  5. 05Schema Registries and Data Contracts

    Implement cross-team schema governance at scale — schema registries, producer-consumer contracts, breaking change negotiation, contract testing across team boundaries, and the organizational processes that prevent schema changes from breaking downstream pipelines.

    Preview lesson 01 →
  6. 06Spark Internals — The Execution Engine

    Understand Spark's execution engine at the implementation level — the DAGScheduler, the TaskScheduler, speculative execution, shuffle mechanics, and the resource negotiation that determines how your application shares the cluster with others.

    Preview lesson 01 →
  7. 07Spark Internals — Storage and I/O

    Understand Spark's storage layer at the implementation level — Parquet internals, columnar encoding, predicate evaluation, the DataSource V2 API, and the emerging table format landscape that determines how data is physically stored and accessed.

    Preview lesson 01 →
  8. 08Multi-Tenant Operations and Security

    Design and operate a secure, multi-tenant Spark platform — Unity Catalog deep dive, row-level and column-level security, audit logging, data masking, and the compliance frameworks that govern data access in regulated industries.

    Preview lesson 01 →
  9. 09Advanced Streaming Architecture

    Design production streaming architectures beyond the mid-level foundations — stream-stream joins, Kafka integration, state store management, exactly-once delivery, and the architectural patterns that make streaming reliable at enterprise scale.

    Preview lesson 01 →
  10. 10Enterprise Governance and Capacity Planning

    Establish the organizational and operational frameworks that sustain a multi-team Spark platform over years — data lineage, capacity planning, platform SLA governance, and the team structures that make data engineering reliable at enterprise scale.

    Preview lesson 01 →
§ 02The practice

One task, mid-tier. Thirty more behind it.

Practice Set 5 — Delta Lake for Data EngineeringMID · DELTA LAKE · TASK 03 / 06

Diagnose Missing Data Using Time Travel

CONTEXT

NordGrid's billing team reports yesterday's Gold summary shows 15% lower total consumption for NORTH_COAST. Engineers suspect a MERGE corrupted the Silver Delta table.

TASK

Find the version number to compare against, then read the table as it was before yesterday's MERGE.

QUESTION

Which command surfaces the version numbers and operation history you need to pick a comparison point?

§ 03The grid game

Earn kWh from study. Deploy it on the grid.

Sessions and drills earn kWh. Spend them deploying these ten components across the country until the grid is back online.

10 COMPONENTS · 6 CATEGORIES
  • Primary SubstationBACKBONE
    PS3 · DataFrames & Schemas

    Where high-voltage transmission steps down to a district feeder. In code, where raw landing rows acquire a schema and become a DataFrame you can actually query.

    Primary Substation250 kWh
  • Power TransformerBACKBONE
    PS4 · Selecting & Filtering

    Voltage matched to local load. In code, columns matched to questions — the operator you reach for to shrink, shape, and route data toward whoever needs it.

    Power Transformer375 kWh
  • Protective RelayPROTECTION
    PM4 · Validation & Bad Data

    Trips faulted lines before damage cascades. In code, the schema enforcement and null/dup guards that stop one bad CSV from corrupting the whole pipeline.

    Protective Relay800 kWh
  • Battery StorageSTORAGE
    PM5 · Delta Lake

    Holds energy until peak demand. Delta holds state — ACID writes, time travel, MERGE — so yesterday's pipeline run is still queryable tomorrow.

    Battery Storage1200 kWh
  • Capacitor BankBALANCING
    PM2 · Window Functions

    Corrects reactive power so the grid stays in phase. Window functions correct row-by-row context so totals, ranks, and rolling sums stay in step with the data.

    Capacitor Bank450 kWh
  • Circuit BreakerPROTECTION
    PS9 · Error Handling

    Opens the moment a fault is detected. In code, the try/except, checkpoint, and dead-letter routing that isolate one bad batch from the next.

    Circuit Breaker320 kWh
  • Control CenterCOMMAND
    PX3 · Spark UI & Tuning

    The mimic board operators read in real time. The Spark UI is yours — stage timelines, executor heatmaps, SQL DAGs — once you know how to read it.

    Control Center2400 kWh
  • Smart InverterBALANCING
    PX1 · Catalyst & AQE

    Converts DC to AC on the fly, optimising for whatever the grid needs. Adaptive Query Execution rewrites your plan at runtime for the data it actually sees.

    Smart Inverter680 kWh
  • Solar ArrayGENERATION
    PM7 · Streaming

    Generates power as long as the source flows. Structured Streaming treats the source the same way — every micro-batch a new arrival on the same pipeline.

    Solar Array950 kWh
  • Wind ClusterGENERATION
    PX5 · Partitioning & Skew

    Many small generators, one combined output. Many small partitions, one combined job — but only if the wind is even. Skew is what wrecks both.

    Wind Cluster1100 kWh
EVERY COMPONENT TIES TO A CHAPTER YOU READOpen the grid game
§ 04Self-selection

Honest about who this is for.

For
  • Engineers shipping pipelines who want depth, not another intro.
  • Analysts moving from pandas/SQL who need PySpark idioms that scale.
  • Senior data folks brushing up before interviews or platform reviews.
  • Self-directed readers who prefer typography over talking heads.
Not for
  • Absolute programming beginners — assumes Python comfort and basic SQL.
  • People who want video tutorials with autoplay and sticky chapter timers.
  • Anyone shopping for a credential. There is no certificate.
  • Tool-of-the-month tourists. This is PySpark, deeply, and nothing else.
§ 05Subscription

Free during beta. €14.99 once if you want to back it.

Start free No credit card during beta