Lab 04: Efficient Formats and Vectorization
Welcome to the fourth Big Data laboratory session! In this lab, you'll compare storage formats (CSV, Parquet, Feather) and master vectorization to build fast data pipelines.
Additional Resources
- Tips & Reference Guide - detailed tips, code examples, and cheatsheets.
What You Will Learn
- Format Comparison: Measure size and speed trade-offs between CSV, Parquet (Snappy, Zstd), and Feather
- Column Pruning & Predicate Pushdown: Read only what you need from Parquet files
- Vectorization: Replace slow Python loops with fast NumPy/Pandas operations (100-200x speedup)
- Broadcasting: Apply operations across arrays without explicit loops
- Pipeline Optimization: Combine efficient formats + vectorization for maximum performance
Pre-flight Checklist
Before starting, ensure you have:
- Completed Lab 03: You understand data types and storage formats.
- Updated your repo: Run
git pullto get the latest changes. - Checkout main: Run
git checkout main. - Create a local branch: Run
git checkout -b <your_branch_name> - Installed dependencies: Run
uv sync.
Lab Steps
Follow along in the notebook notebooks/lab04_formats_vectorization.ipynb.
A. Generate Sales Dataset
Create a realistic sales dataset with 5 million rows:
id,fecha,categoria,producto,precio,cantidad,ciudad
B. Exercise 1: CSV vs Parquet vs Feather (25 min)
You will save the dataset in multiple formats and benchmark them.
Part 1A: Save in All Formats
Save the DataFrame as CSV, Parquet (Snappy, Zstd, None compression), and Feather. Measure write time and file size.
Part 1B: Read Benchmarks
Benchmark three scenarios:
- Full read: Read the entire file
- Selective read: Read only 2 columns (
precio,cantidad) - Filtered read: Read only rows where
categoria == 'Electronica'(Parquet predicate pushdown)
Part 1C: Summary Table
Create a summary table with size, write time, and read time for each format.
Goal: Understand when to use each format and the impact of compression.
C. Exercise 2: Rewrite Loops to Vectorized (25 min)
You will rewrite slow loop-based code using vectorized operations.
Part 2A: Distance Calculation
Convert a loop-based Euclidean distance calculation to NumPy broadcasting.
# Original (slow)
def calculate_distances_slow(points_a, points_b):
distances = []
for i in range(len(points_a)):
dist = math.sqrt(
(points_a[i][0] - points_b[i][0])**2 +
(points_a[i][1] - points_b[i][1])**2
)
distances.append(dist)
return distances
Part 2B: Age Classification
Replace conditional loop with np.select() or pd.cut().
Part 2C: Column Normalization
Replace nested loop with broadcasting: (data - mean) / std.
Part 2D: Score Calculation with Clipping
Replace loop with vectorized operations and np.clip().
Goal: Achieve >50x speedup on each function.
D. Exercise 3: Comprehensive Benchmark (20 min)
You will quantify the performance impact of vectorization across 5 scenarios.
Benchmarks:
- Sum: loop vs
.sum() - Element-wise multiply: loop vs operator
* - Filter + transform: loop vs
.loc[] .apply()with lambda vs vectorized.apply()with complex function vs NumPy equivalent
Goal: Document speedups for different operation types.
E. Exercise 4: Integrated Pipeline (20 min)
You will build and compare two data pipelines:
Naive pipeline: Read CSV + process with Python loops
df = pd.read_csv('ventas.csv')
totals = []
for i in range(len(df)):
if df.iloc[i]['categoria'] == 'Electronica':
totals.append(df.iloc[i]['precio'] * df.iloc[i]['cantidad'])
Optimized pipeline: Read Parquet (selective + filtered) + vectorized operations
df = pd.read_parquet('ventas_snappy.parquet',
columns=['categoria', 'precio', 'cantidad'],
filters=[('categoria', '==', 'Electronica')])
df['total'] = df['precio'] * df['cantidad']
Goal: Measure total speedup from combining format + vectorization.
What to Submit
Submit exactly these two files:
notebooks/lab04_formats_vectorization.ipynb— Your completed notebook.results/lab04_metrics.json— Generated metrics file.
Do NOT submit:
- Large generated data files
- __pycache__ directories
Key Takeaways
After completing this lab, remember:
- Parquet is 3-10x smaller and faster than CSV for analytical workloads
- Compression trade-offs: Snappy = fast, Zstd = best ratio, None = fastest write
- Feather is best for intermediate data between pipeline steps
- Vectorization is 100-200x faster than Python loops
- Never use
.apply()with simple lambdas — always find vectorized alternative - Combining optimizations (format + vectorization) yields massive speedups
Questions? Check the Tips & Reference Guide or ask your instructor.