Lab 03: Data Types

Welcome to the third Big Data laboratory session! In this lab, you'll learn how to dramatically reduce memory usage and improve performance through smart data type choices.

📚 Additional Resources

Tips & Reference Guide - detailed tips, code examples, and cheatsheets for every exercise.
Lab 02 Instructions - if you need to review complexity and profiling.

🎯 What You Will Learn

Memory Measurement: How to accurately measure DataFrame memory usage
Data Range Analysis: How to analyze value ranges to choose optimal types
Integer Sizing: When to use uint8, uint16, uint32, etc.
Category dtype: How to optimize repeated strings for huge memory savings
Performance Impact: How smaller types lead to faster operations

✅ Pre-flight Checklist

Before starting, ensure you have:

Completed Lab 02: You understand profiling and complexity.
Updated your repo: Run git pull to get the latest changes.
Checkout main: Run git checkout main.
Create a local branch: Run git checkout -b <your_branch_name>
Installed dependencies: Run uv sync.

📝 Lab Steps

Follow along in the notebook notebooks/lab03_data_types.ipynb.

A. Generate the Dataset

We'll generate a synthetic e-commerce dataset with 5 million rows containing:

order_id: Unique order identifier (0 to 4,999,999)
product_id: Product ID (1-50,000)
category: Product category (15 unique values)
price: Product price (0.01 - 999.99)
quantity: Quantity ordered (1-100)
country: Customer country (30 unique values)
timestamp: Order timestamp

TODO 1: Implement generate_ecommerce_data() to create the dataset.

Goal: Create data/raw/ecommerce_5m.csv (~500MB).

B. Baseline Measurement

Measure how much memory pandas wastes with default dtypes.

TODO 2: Implement measure_memory() to measure DataFrame memory usage.

What you'll observe:

int64 uses 8 bytes for ALL integers (even small ones!)
float64 uses 8 bytes for ALL floats
object uses ~50+ bytes per string value

Goal: Understand the memory cost of default types.

C. Type Analysis & Optimization

This is the core of the lab: analyze your data and choose optimal types.

TODO 3: Implement analyze_column_ranges() to identify min/max/nunique for each column.

TODO 4: Implement get_optimal_dtypes() to return the optimal dtype mapping:

Column	Range	Optimal Type
order_id	0 to 5M	`uint32` (max 4.3B)
product_id	1 to 50,000	`uint16` (max 65,535)
category	15 unique	`category`
price	0.01 to 999.99	`float32`
quantity	1 to 100	`uint8` (max 255)
country	30 unique	`category`

TODO 5: Implement load_with_optimized_dtypes() to load CSV with optimal types.

Goal: Achieve >5x memory reduction.

D. Performance Impact

Smaller types aren't just about memory — they're also faster!

TODO 6: Implement benchmark_operation() to compare:

Groupby operations
Filter operations
Sort operations

TODO 7: Implement calculate_savings() to summarize total improvements.

Goal: Measure the performance speedup from optimization.

E. Reflection & Save Results (15 min)

Write a short reflection and save your metrics.

Goal: Complete results/lab03_metrics.json.

📦 What to Submit

Submit exactly these two files:

notebooks/lab03_data_types.ipynb — Your completed notebook.
results/lab03_metrics.json — The JSON file generated by the notebook.

Do NOT submit:

The generated data files (CSV)
The __pycache__ directories

🚀 Next Steps

After completing this lab:

Check your results/lab03_metrics.json.
Verify you achieved >5x memory reduction.
Submit your work!

Questions? Check the Tips & Reference Guide or ask your instructor.