Lab 03: Tips & Quick Reference
This guide provides detailed tips, code examples, and cheatsheets to help you complete Lab 03.
📚 General Tips
- Start with analysis: Before optimizing, understand your data's actual value ranges.
- Use
memory_usage(deep=True): Thedeep=Trueflag is essential for accurate object dtype measurement. - Profile before and after: Always measure the impact of your changes.
🔑 Essential Functions Cheat Sheet
Memory Measurement
# Total memory usage
df.memory_usage(deep=True).sum() / 1e6 # MB
# Per-column breakdown
df.memory_usage(deep=True)
# Memory of a specific column
df['column'].memory_usage(deep=True) / 1e6 # MB
Data Type Information
# View all dtypes
df.dtypes
# Check unique values (for category decisions)
df['column'].nunique()
# Check value ranges (for int sizing)
df['column'].min(), df['column'].max()
Timing Code
import time
start = time.perf_counter()
# ... operation ...
elapsed = time.perf_counter() - start
print(f"Time: {elapsed:.3f} seconds")
Section A: Dataset Generation
What you need to do
Generate a 5 million row e-commerce dataset.
Key columns and generators
| Column | Type | Generator |
|---|---|---|
| order_id | int | np.arange(n) |
| product_id | int | np.random.randint(1, 50001, size=n) |
| category | str | np.random.choice([...], size=n) |
| price | float | np.random.uniform(0.01, 999.99, size=n) |
| quantity | int | np.random.randint(1, 101, size=n) |
| country | str | np.random.choice([...], size=n) |
| timestamp | datetime | pd.date_range(...) |
Sample categories
categories = [
'Electronics', 'Clothing', 'Home', 'Books', 'Toys',
'Sports', 'Beauty', 'Food', 'Garden', 'Automotive',
'Health', 'Office', 'Pet', 'Music', 'Games'
]
Sample countries
countries = [
'Spain', 'France', 'Germany', 'Italy', 'UK',
'Portugal', 'Netherlands', 'Belgium', 'Poland', 'Sweden',
'Norway', 'Denmark', 'Finland', 'Austria', 'Switzerland',
'Ireland', 'Greece', 'Czech', 'Romania', 'Hungary',
'USA', 'Canada', 'Mexico', 'Brazil', 'Argentina',
'Japan', 'China', 'Australia', 'India', 'Korea'
]
Section B: Baseline Measurement
Measuring memory accurately
Key steps:
- Get memory usage: Use
df.memory_usage(deep=True)to get a Series with memory per column - Returns a Series where index = column names, values = bytes
-
Don't forget
deep=True! -
Calculate total memory: Sum all values and convert bytes to MB
-
Bytes to MB: divide by
1e6 -
Build the result dictionary: Create a dict with two keys:
'total_mb': the total memory in MB-
'columns': a nested dict with info for each column -
For each column, store:
'dtype': convert dtype to string withstr(df[col].dtype)'memory_mb': get memory from the Series, convert to MB'nunique': usedf[col].nunique()
Hint: You can iterate over columns with for col in df.columns:
Expected output structure
# Example return value:
{
'total_mb': 2456.78,
'columns': {
'order_id': {
'dtype': 'int64',
'memory_mb': 38.15,
'nunique': 5000000
},
'product_id': {
'dtype': 'int64',
'memory_mb': 38.15,
'nunique': 50000
},
'category': {
'dtype': 'object',
'memory_mb': 856.42,
'nunique': 15
},
# ... more columns
}
}
Why deep=True matters
Without deep=True, pandas only counts the pointer memory for object columns, not the actual string data!
# Wrong (underestimates strings)
df.memory_usage()
# Correct (includes string content)
df.memory_usage(deep=True)
Section C: Type Analysis & Optimization
Type Selection Guide
Is it numeric?
├─ NO → Is it a string?
│ ├─ YES → Are <50% values unique?
│ │ ├─ YES → category
│ │ └─ NO → object (or string dtype)
│ └─ Is it a date? → datetime64
└─ YES → Is it an integer?
├─ YES → What's the range?
│ ├─ 0 to 255 → uint8
│ ├─ -128 to 127 → int8
│ ├─ 0 to 65,535 → uint16
│ ├─ -32,768 to 32,767 → int16
│ ├─ 0 to 4B → uint32
│ └─ Larger → int64
└─ NO (decimal) → Precision needed?
├─ Low (2-3 decimals) → float32
└─ High (scientific) → float64
Integer Types Reference
| Type | Min | Max | Bytes |
|---|---|---|---|
| int8 | -128 | 127 | 1 |
| uint8 | 0 | 255 | 1 |
| int16 | -32,768 | 32,767 | 2 |
| uint16 | 0 | 65,535 | 2 |
| int32 | -2.1B | 2.1B | 4 |
| uint32 | 0 | 4.3B | 4 |
| int64 | -9.2Q | 9.2Q | 8 |
Analyzing column ranges
Key steps for analyze_column_ranges():
-
Create an empty result dictionary to store analysis for each column
-
Loop through each column in the DataFrame
-
Check the column type:
- Use
np.issubdtype(col_type, np.number)to check if numeric - Check if
col_type == 'object'for strings -
Otherwise it's likely datetime or other special type
-
For numeric columns, store:
'min': usedf[col].min()'max': usedf[col].max()-
'nunique': useint(df[col].nunique()) -
For string (object) columns, store:
'nunique': number of unique values'max_len': useint(df[col].str.len().max())-
'sample': first 5 unique values as a list -
For other types (datetime, etc.), store:
'dtype': the string representation of the dtype'nunique': number of unique values
Expected output structure
# Example return value:
{
'order_id': {
'min': 0,
'max': 4999999,
'nunique': 5000000
},
'product_id': {
'min': 1,
'max': 50000,
'nunique': 49987
},
'category': {
'nunique': 15,
'max_len': 11,
'sample': ['Electronics', 'Clothing', 'Home', 'Books', 'Toys']
},
'price': {
'min': 0.01,
'max': 999.99,
'nunique': 99989
},
# ... more columns
}
Loading with optimized dtypes
How to specify dtypes when reading CSV:
Use the dtype parameter in pd.read_csv() to specify types for each column:
df = pd.read_csv('file.csv', dtype={
'column_name': 'dtype_string',
'another_column': 'dtype_string',
# ... more columns
})
Section D: Performance Impact
Benchmarking operations
Key steps for benchmark_operation():
- Use an if-elif structure to handle different operation types:
'groupby_sum': Group by category and sum prices'filter': Filter rows where country equals a specific value-
'sort': Sort by price column -
For each operation type:
- Time the baseline DataFrame:
- Start timer with
start = time.perf_counter() - Execute the operation (store result in
_to discard it) - Calculate elapsed time:
time.perf_counter() - start
- Start timer with
-
Time the optimized DataFrame (same process)
-
Return a dictionary with:
'baseline_sec': time for baseline (rounded to 4 decimals)'optimized_sec': time for optimized (rounded to 4 decimals)'speedup': baseline_sec / optimized_sec (rounded to 2 decimals)
Detailed steps for each operation:
For 'groupby_sum':
- Start timer:
start = time.perf_counter() - Execute on baseline:
_ = df_baseline.groupby('category')['price'].sum() - Calculate baseline time:
baseline_sec = time.perf_counter() - start - Start new timer:
start = time.perf_counter() - Execute on optimized:
_ = df_optimized.groupby('category')['price'].sum() - Calculate optimized time:
optimized_sec = time.perf_counter() - start
For 'filter':
- Start timer
- Execute on baseline:
_ = df_baseline[df_baseline['country'] == 'Spain'] - Calculate baseline time
- Start new timer
- Execute on optimized:
_ = df_optimized[df_optimized['country'] == 'Spain'] - Calculate optimized time
For 'sort':
- Start timer
- Execute on baseline:
_ = df_baseline.sort_values('price') - Calculate baseline time
- Start new timer
- Execute on optimized:
_ = df_optimized.sort_values('price') - Calculate optimized time
Then for all operations, calculate the speedup and return the dictionary.
Expected output structure
Why category is faster
The category dtype stores:
- A dictionary of unique values: {0: 'Electronics', 1: 'Clothing', ...}
- Integer codes for each row: [0, 1, 0, 2, 1, ...]
Operations like groupby and filter use integer comparisons instead of string comparisons!
Calculating total savings
Key steps for calculate_savings():
- Calculate memory saved:
- Subtract optimized memory from baseline memory
-
Access total memory with:
baseline_memory['total_mb']andoptimized_memory['total_mb'] -
Calculate memory reduction factor:
- Divide baseline memory by optimized memory
-
This tells you "how many times smaller" the optimized version is
-
Calculate average speedup:
- Extract the
'speedup'value from each benchmark result in the list - Use a list comprehension:
[r['speedup'] for r in benchmark_results] -
Calculate the average:
sum(speedups) / len(speedups) -
Return a dictionary with:
'memory_saved_mb': how many MB were saved (rounded to 2 decimals)'memory_reduction_factor': memory reduction factor (rounded to 2 decimals)'avg_speedup': average speedup across all operations (rounded to 2 decimals)
Expected output structure
# Example return value:
{
'memory_saved_mb': 2056.34,
'memory_reduction_factor': 6.25,
'avg_speedup': 2.42
}
📊 Expected Results
When you complete the lab, you should see something like:
Memory Analysis:
Baseline memory: ~2,500 MB
Optimized memory: ~400 MB
Reduction: 6.3x
Performance:
Groupby speedup: 2-3x
Filter speedup: 2-4x
Sort speedup: 1.5-2x
⚠️ Common Pitfalls
| Mistake | Fix |
|---|---|
Forgetting deep=True in memory_usage |
Always use df.memory_usage(deep=True) |
| Using int64 for small ranges | Check .min() and .max(), use smallest int |
| Not using category for repeated strings | If nunique() / len(df) < 0.5, use category |
| Specifying dtype for timestamp | Use parse_dates=['timestamp'] separately |
| Using signed int when values are ≥ 0 | Use unsigned types (uint8, uint16, etc.) |
🔗 Useful Links
📦 Files to Submit
notebooks/lab03_data_types.ipynb(with all cells executed)results/lab03_metrics.json(generated by the notebook)
Do NOT submit:
- Generated data files
- __pycache__ directories
Good luck! 🎉