Skip to content

Lab 03: Data Types

Welcome to the third Big Data laboratory session! In this lab, you'll learn how to dramatically reduce memory usage and improve performance through smart data type choices.

📚 Additional Resources

🎯 What You Will Learn

  • Memory Measurement: How to accurately measure DataFrame memory usage
  • Data Range Analysis: How to analyze value ranges to choose optimal types
  • Integer Sizing: When to use uint8, uint16, uint32, etc.
  • Category dtype: How to optimize repeated strings for huge memory savings
  • Performance Impact: How smaller types lead to faster operations

✅ Pre-flight Checklist

Before starting, ensure you have:

  1. Completed Lab 02: You understand profiling and complexity.
  2. Updated your repo: Run git pull to get the latest changes.
  3. Checkout main: Run git checkout main.
  4. Create a local branch: Run git checkout -b <your_branch_name>
  5. Installed dependencies: Run uv sync.

📝 Lab Steps

Follow along in the notebook notebooks/lab03_data_types.ipynb.

A. Generate the Dataset

We'll generate a synthetic e-commerce dataset with 5 million rows containing:

  • order_id: Unique order identifier (0 to 4,999,999)
  • product_id: Product ID (1-50,000)
  • category: Product category (15 unique values)
  • price: Product price (0.01 - 999.99)
  • quantity: Quantity ordered (1-100)
  • country: Customer country (30 unique values)
  • timestamp: Order timestamp

TODO 1: Implement generate_ecommerce_data() to create the dataset.

Goal: Create data/raw/ecommerce_5m.csv (~500MB).


B. Baseline Measurement

Measure how much memory pandas wastes with default dtypes.

TODO 2: Implement measure_memory() to measure DataFrame memory usage.

What you'll observe:

  • int64 uses 8 bytes for ALL integers (even small ones!)
  • float64 uses 8 bytes for ALL floats
  • object uses ~50+ bytes per string value

Goal: Understand the memory cost of default types.


C. Type Analysis & Optimization

This is the core of the lab: analyze your data and choose optimal types.

TODO 3: Implement analyze_column_ranges() to identify min/max/nunique for each column.

TODO 4: Implement get_optimal_dtypes() to return the optimal dtype mapping:

Column Range Optimal Type
order_id 0 to 5M uint32 (max 4.3B)
product_id 1 to 50,000 uint16 (max 65,535)
category 15 unique category
price 0.01 to 999.99 float32
quantity 1 to 100 uint8 (max 255)
country 30 unique category

TODO 5: Implement load_with_optimized_dtypes() to load CSV with optimal types.

Goal: Achieve >5x memory reduction.


D. Performance Impact

Smaller types aren't just about memory — they're also faster!

TODO 6: Implement benchmark_operation() to compare:

  • Groupby operations
  • Filter operations
  • Sort operations

TODO 7: Implement calculate_savings() to summarize total improvements.

Goal: Measure the performance speedup from optimization.


E. Reflection & Save Results (15 min)

Write a short reflection and save your metrics.

Goal: Complete results/lab03_metrics.json.


📦 What to Submit

Submit exactly these two files:

  1. notebooks/lab03_data_types.ipynb — Your completed notebook.
  2. results/lab03_metrics.json — The JSON file generated by the notebook.

Do NOT submit:

  • The generated data files (CSV)
  • The __pycache__ directories

🚀 Next Steps

After completing this lab:

  1. Check your results/lab03_metrics.json.
  2. Verify you achieved >5x memory reduction.
  3. Submit your work!

Questions? Check the Tips & Reference Guide or ask your instructor.