Lab 02: Complexity
Welcome to the second Big Data laboratory session! In this lab, we will leave the safety of small datasets and enter the world where algorithmic complexity matters.
π Additional Resources
- Tips & Reference Guide - detailed tips, code examples, and cheatsheets for every TODO.
- Lab 01 Instructions - if you need to review setup or I/O basics.
π― What You Will Learn
- Time Complexity: Why O(NΒ²) fails at scale while O(N log N) succeeds.
- Space Complexity: Understanding memory trade-offs in algorithm design.
- Profiling Tools: How to use
cProfile,py-spyflamegraphs, andline_profilerto find bottlenecks. - Optimization: Converting O(NΒ²) algorithms to O(N) for massive speedups.
π Implementation Overview
You will implement these 7 functions:
| TODO | Function | Purpose |
|---|---|---|
| 1 | generate_user_logs() |
Generate 1M row dataset |
| 2 | benchmark_search() |
Compare List vs Set |
| 3 | bubble_sort() |
Classic O(NΒ²) algorithm |
| 4 | find_duplicates_set() |
Hash-based O(N) approach |
| 5 | find_duplicates_inplace() |
Sort-based O(N log N) approach |
| 6 | profile_function() |
Using cProfile |
| 7 | find_duplicates_fast() |
The 10x optimization challenge |
Provided for you:
- β
find_duplicates_slow()β The deliberately terrible O(NΒ²) version to profile - β Flamegraph and line_profiler demonstrations
β Pre-flight Checklist
Before starting, ensure you have:
- Completed Lab 01: You understand basic I/O and have your environment set up.
- Updated your repo: Run
git pullto get the latest changes. - Create a branch:
git checkout -b lab02-complexity - Installed dependencies: Run
uv sync --group lab02to install profiling tools.
πΊοΈ Learning Path
Dataset β Time Complexity β Space Complexity β Profiling β Optimization
(TODO 1) (TODO 2-3) (TODO 4-5) (TODO 6) (TODO 7)
β β β β β
1M rows O(N) vs O(1) O(N) space vs Find the 10x speedup
CSV O(NΒ²) vs O(N log N) O(1) space bottleneck with O(N)
π Lab Exercises
Follow along in the notebook notebooks/lab02_complexity_dataflow.ipynb.
Exercise 1: Dataset Generation
TODO 1: generate_user_logs()
We need a dataset large enough to expose inefficient code. Implement this function to create:
- Rows: 1,000,000
- Columns:
user_id,session_id,action,timestamp,value - Features: Duplicate user IDs (essential for later exercises)
Goal: Save this as data/raw/user_logs_1m.csv.
Exercise 2: Time Complexity β Search & Sort
Compare the performance impact of different data structures and algorithms.
TODO 2: benchmark_search() β O(N) vs O(1)
Compare finding an item in a Python List versus a Set:
- Create a list and set of 1M numbers.
- Search for 1,000 random keys in both.
- Calculate the speedup.
What to expect: The Set should be ~1000x faster.
TODO 3: bubble_sort() β O(NΒ²) vs O(N log N)
Implement the classic bubble sort algorithm:
- Compare adjacent elements, swap if out of order.
- Repeat until sorted.
- Compare against Python's built-in
sorted()at N = 100, 1000, 5000.
What to expect: Python sort should be 100x+ faster at N=5000.
Exercise 3: Space Complexity β Memory Trade-offs
Not just time β sometimes memory is the bottleneck.
TODO 4: find_duplicates_set()
- O(N) time, O(N) space β uses a set to track seen items
- Fast but uses memory proportional to input size
TODO 5: find_duplicates_inplace()
- O(N log N) time, O(1) extra space β sorts in-place
- Memory efficient but modifies input and is slower
Goal: Understand the time-space trade-off.
| Approach | Time | Space | Best When... |
|---|---|---|---|
| Set-based | O(N) | O(N) | Memory is plentiful, speed is critical |
| Sort in-place | O(N log N) | O(1) | Memory is limited, data can be modified |
Exercise 4: Profiling Bottlenecks
We provide a deliberately terrible O(NΒ²) function: find_duplicates_slow().
TODO 6: profile_function()
- Implement using Python's
cProfile. - Run on
find_duplicates_slow()with a small sample (2-5k rows). - Analyze the output to identify the slowest functions.
Flamegraph Visualization (Demonstration)
- Run the notebook cell to generate a standalone script.
- Run from terminal:
py-spy record -o flamegraph.svg -- python src/flamegraph_profile.py - Open the SVG in browser and identify the widest bar (= bottleneck).
Line-by-Line Profiling (Demonstration)
- Use the provided
line_profilercell. - Identify the exact lines that consume the most time.
Goal: Learn that you can't optimize what you can't measure.
Exercise 5: The 10x Challenge π
TODO 7: find_duplicates_fast()
Your task is to refactor the slow function to be at least 10x faster.
- Strategy: Use a Hash Map (Dictionary or
collections.Counter) to count items in O(N). - Validation: Compare your results against the slow version.
- Bonus: Can you make it 1000x faster?
Goal: Prove that Algorithms > Hardware. A better algorithm on a laptop beats a bad algorithm on a supercomputer.
π Expected Results
When you complete the lab successfully:
| Exercise | Metric | Expected Value |
|---|---|---|
| TODO 2 | Set vs List speedup | ~1000x |
| TODO 3 | Python sort vs Bubble (N=5000) | ~100-2500x |
| TODO 4-5 | Set-based faster than in-place | Yes |
| TODO 7 | Fast vs Slow speedup | β₯10x (often 10,000x!) |
π¦ What to Submit
Submit exactly these two files:
notebooks/lab02_complexity.ipynbβ Your completed notebook.results/lab02_metrics.jsonβ The JSON file generated by the notebook.
Do NOT submit:
- The 1M row CSV file (it's ~60-80MB).
- The __pycache__ directories.
- Flamegraph SVG files.
π Next Steps
After completing this lab:
- Check your
results/lab02_metrics.json. - Write your reflection in the notebook.
- Submit your work!
Questions? Check the Tips & Reference Guide or ask your instructor.