Lab 03: Data Types
Welcome to the third Big Data laboratory session! In this lab, you'll learn how to dramatically reduce memory usage and improve performance through smart data type choices.
📚 Additional Resources
- Tips & Reference Guide - detailed tips, code examples, and cheatsheets for every exercise.
- Lab 02 Instructions - if you need to review complexity and profiling.
🎯 What You Will Learn
- Memory Measurement: How to accurately measure DataFrame memory usage
- Data Range Analysis: How to analyze value ranges to choose optimal types
- Integer Sizing: When to use uint8, uint16, uint32, etc.
- Category dtype: How to optimize repeated strings for huge memory savings
- Performance Impact: How smaller types lead to faster operations
✅ Pre-flight Checklist
Before starting, ensure you have:
- Completed Lab 02: You understand profiling and complexity.
- Updated your repo: Run
git pullto get the latest changes. - Checkout main: Run
git checkout main. - Create a local branch: Run
git checkout -b <your_branch_name> - Installed dependencies: Run
uv sync.
📝 Lab Steps
Follow along in the notebook notebooks/lab03_data_types.ipynb.
A. Generate the Dataset
We'll generate a synthetic e-commerce dataset with 5 million rows containing:
order_id: Unique order identifier (0 to 4,999,999)product_id: Product ID (1-50,000)category: Product category (15 unique values)price: Product price (0.01 - 999.99)quantity: Quantity ordered (1-100)country: Customer country (30 unique values)timestamp: Order timestamp
TODO 1: Implement generate_ecommerce_data() to create the dataset.
Goal: Create data/raw/ecommerce_5m.csv (~500MB).
B. Baseline Measurement
Measure how much memory pandas wastes with default dtypes.
TODO 2: Implement measure_memory() to measure DataFrame memory usage.
What you'll observe:
int64uses 8 bytes for ALL integers (even small ones!)float64uses 8 bytes for ALL floatsobjectuses ~50+ bytes per string value
Goal: Understand the memory cost of default types.
C. Type Analysis & Optimization
This is the core of the lab: analyze your data and choose optimal types.
TODO 3: Implement analyze_column_ranges() to identify min/max/nunique for each column.
TODO 4: Implement get_optimal_dtypes() to return the optimal dtype mapping:
| Column | Range | Optimal Type |
|---|---|---|
| order_id | 0 to 5M | uint32 (max 4.3B) |
| product_id | 1 to 50,000 | uint16 (max 65,535) |
| category | 15 unique | category |
| price | 0.01 to 999.99 | float32 |
| quantity | 1 to 100 | uint8 (max 255) |
| country | 30 unique | category |
TODO 5: Implement load_with_optimized_dtypes() to load CSV with optimal types.
Goal: Achieve >5x memory reduction.
D. Performance Impact
Smaller types aren't just about memory — they're also faster!
TODO 6: Implement benchmark_operation() to compare:
- Groupby operations
- Filter operations
- Sort operations
TODO 7: Implement calculate_savings() to summarize total improvements.
Goal: Measure the performance speedup from optimization.
E. Reflection & Save Results (15 min)
Write a short reflection and save your metrics.
Goal: Complete results/lab03_metrics.json.
📦 What to Submit
Submit exactly these two files:
notebooks/lab03_data_types.ipynb— Your completed notebook.results/lab03_metrics.json— The JSON file generated by the notebook.
Do NOT submit:
- The generated data files (CSV)
- The
__pycache__directories
🚀 Next Steps
After completing this lab:
- Check your
results/lab03_metrics.json. - Verify you achieved >5x memory reduction.
- Submit your work!
Questions? Check the Tips & Reference Guide or ask your instructor.