Big Data Course Labs

Welcome to the Big Data course lab repository. These hands-on labs progressively build your skills in large-scale data processing, from file I/O fundamentals through to kernel approximation methods used in machine learning.

About This Course

In the era of massive datasets, traditional data processing tools are no longer sufficient. This course teaches you how to work with large-scale data efficiently by understanding:

Storage optimization: Why file formats matter and how columnar storage (Parquet) outperforms row-based formats (CSV)
Performance measurement: How to benchmark I/O operations and identify bottlenecks
Resource management: Understanding memory constraints and handling out-of-memory scenarios
Algorithmic thinking: How to choose data structures and algorithms that scale
Modern tooling: Using professional Python tools like uv for fast dependency management

Learning Outcomes

By the end of this course, you will be able to:

Analyze and optimize data storage formats for different use cases
Measure and compare the performance of different data processing approaches
Design systems that handle datasets larger than available RAM
Implement probabilistic data structures with provable error guarantees
Apply kernel approximation methods to make kernel ML feasible at scale
Understand distributed computing fundamentals: lazy evaluation, shuffles, and partitioning

Course Structure

Lab 01: Environment Setup and I/O Benchmarking

In this first session (see lab guide) we focus on:

Modern development environment: Using uv to manage Python and dependencies
I/O Benchmark: Compare CSV vs Parquet to experience the performance difference firsthand
Memory management: Understand what happens when your data doesn't fit in RAM
Philosophy: "What isn't measured, can't be improved" — we'll measure read times, disk space, and memory usage

Lab 02: Complexity and the Data Flow

Understanding that \(N=1{,}000\) is not the same as \(N=1{,}000{,}000\) (see lab guide):

The Scale Factor: Why "fast enough" code fails at scale
Memory Hierarchy: Proving via benchmarks that RAM is faster than disk
Big O Notation: Practical application in code profiling and optimization
Data Flow: Chunking, streaming, and full loading strategies

Lab 03: Data Types and Efficient Formats

Understanding that data types matter for performance (see lab guide):

Data Type Optimization: Reduce memory 5–10x with proper dtype selection
Format Comparison: CSV vs Parquet vs Feather trade-offs
Parquet Deep Dive: Row groups, compression, predicate pushdown
Partitioning Strategies: Organize data for fast analytical queries

Lab 04: Vectorization and Broadcasting

Combining efficient storage with fast computation (see lab guide):

Format Comparison: CSV vs Parquet (Snappy, Zstd) vs Feather — size, speed, and features
Column Pruning & Predicate Pushdown: Read only what you need from Parquet
Vectorization: Replace Python loops with NumPy/Pandas (100–200x speedup)
Broadcasting: Apply operations across arrays without explicit loops
Pipeline Optimization: Combine format choice and vectorization for maximum performance

Lab 05: Out-of-Core, Streaming & Parallel Processing

Processing datasets larger than RAM (see lab guide):

PyArrow Direct: When to use PyArrow vs Pandas for I/O and projection pushdown
Out-of-Core Processing: Handle datasets that don't fit in RAM using chunking
Online Statistics: Compute mean/std in a single pass with Welford's algorithm
Parallelization: Threading for I/O-bound, multiprocessing for CPU-bound tasks
Pipeline Design: Combine chunking and parallelization for scalable processing

Lab 06: Streaming Algorithms

Computing statistics over data streams without storing the full dataset (see lab guide):

Reservoir Sampling: Uniform random samples from a stream of unknown length
Count-Min Sketch: Approximate frequency counting with bounded error
HyperLogLog: Cardinality estimation using \(O(\log \log n)\) memory
Heavy Hitters: Find frequent items in a stream with limited memory
Error Analysis: Understanding space–accuracy trade-offs in streaming algorithms

Lab 07: Probabilistic Data Structures & Polars

Exact answers are expensive — approximate answers at scale (see lab guide):

Bloom Filters: Membership testing with zero false negatives and bounded false positives
Count-Min Sketch (extended): Frequency estimation and join size approximation
HyperLogLog (extended): Cardinality estimation under real-world skew
Polars Introduction: High-performance DataFrames with a lazy evaluation engine
Benchmarking: Compare probabilistic vs exact approaches across dataset sizes

Lab 08: Kernel Approximation Methods

Making kernel machine learning feasible for large datasets (see lab guide):

Exact RBF Kernel: Understanding the \(O(n^2)\) scalability problem
Random Fourier Features (RFF): Randomized approximation via Bochner's theorem
Orthogonal Random Features (ORF): Reduced-variance alternative to RFF
Nyström Approximation: Landmark-based low-rank kernel factorization
Kernel Ridge Regression: End-to-end regression with approximate kernels
Benchmarking: Time, memory, and approximation error across all methods

Lab 09: Final Project Setup — Git, CI/CD & Documentation

Building a professional meteorological analysis project from scratch (see lab guide):

Repository Skeleton: Directory structure, .gitignore, and initial README
pyproject.toml & Python Package: Single source of truth for dependencies, tooling config, and the src/weather/ package
Tests with pytest: Unit tests with coverage reports targeting ≥70%
Code Quality with ruff: Lint and auto-format all Python files before every commit
MkDocs Documentation: Documentation site with the Material theme and light/dark palette
GitHub Actions CI/CD: Three workflows — CI (lint + test), docs deploy to GitHub Pages, and auto-versioned releases
README with Badges: Live CI, coverage, version, and Python badges

Lab 10: PySpark — First Contact

Introduction to distributed computing with Apache Spark (see lab guide):

Spark Architecture: Driver, executors, and the Spark UI in local[*] mode
RDD vs DataFrame APIs: Low-level MapReduce vs high-level SQL-like pipelines
Lazy Evaluation & DAG: Transformations build a plan; only actions trigger execution
Shuffles & Stages: Why join, groupBy, and orderBy force data movement
Partitioning & Hot Spots: How skewed data creates bottleneck partitions
Benchmarking: Measuring the real cost of repartitioning vs shuffling

Getting Started

New to the course? Start with Lab 01, which includes complete setup instructions for Python, VS Code, Git, and uv.

Tools We Use

Python 3.11+: Modern Python with type hints
uv: Ultra-fast Python package manager
pandas / numpy: Data manipulation and numerical computing
scipy: Scientific computing — linear algebra, statistics
polars: High-performance DataFrame library with lazy evaluation
PySpark: Distributed data processing with Apache Spark
pyarrow / Parquet: Efficient columnar storage
matplotlib: Visualization
Jupyter: Interactive notebooks for exploration
VS Code: Professional code editor

Use the menu on the left to access instructions and reference guides for each lab.