Lab 01: Tips & Quick Reference
Complete guide with detailed tips, code examples, and quick reference for all TODO functions.
📚 General Tips
Before you start:
- Read the docstring carefully - it tells you exactly what the function should do
- Look at the test cell below each function - it shows you how the function will be used
- Start simple - get something working, then refine it
- Use the Python documentation if you're stuck on a specific function
🔑 Essential Functions Cheat Sheet
Path Operations
from pathlib import Path
# Create a Path object (instantiate)
path = Path("data/raw") # Relative path
path = Path("/absolute/path/to/file") # Absolute path
path = Path("data") / "raw" / "file.csv" # Build path with / operator
path = Path.cwd() # Current working directory
path = Path.home() # User's home directory
# Check if exists
path.exists()
# Get file size
path.stat().st_size
# Create directory
path.mkdir(parents=True, exist_ok=True)
NumPy Random Data
import numpy as np
# Set seed for reproducibility
np.random.seed(42)
# Random integers (1 to 10000)
np.random.randint(1, 10001, size=n)
# Random floats (0 to 100)
np.random.uniform(0, 100, size=n)
# Random choice from list
np.random.choice(["A", "B", "C"], size=n)
# Median of a list
np.median([1, 2, 3, 4, 5]) # Returns 3
Pandas DataFrames
import pandas as pd
# Create date range
pd.date_range("2024-01-01", periods=100, freq="h")
# Create DataFrame from dict
df = pd.DataFrame({
"col1": [1, 2, 3],
"col2": ["a", "b", "c"]
})
# Get shape (rows, cols)
df.shape # Returns tuple: (n_rows, n_cols)
# Read CSV
df = pd.read_csv("file.csv")
# Write CSV
df.to_csv("file.csv", index=False)
# Read Parquet
df = pd.read_parquet("file.parquet")
# Write Parquet
df.to_parquet("file.parquet", index=False)
Timing Code
import time
# High-precision timer
start = time.perf_counter()
# ... code to time ...
end = time.perf_counter()
elapsed = end - start # In seconds
JSON Operations
import json
# Write JSON (pretty-printed)
with open("file.json", "w") as f:
json.dump(my_dict, f, indent=2)
# Read JSON
with open("file.json") as f:
data = json.load(f)
TODO 1: ensure_dir()
What you need to do
Create a directory (and any parent directories) if it doesn't already exist.
Key concepts
- The
Pathobject has a.mkdir()method -
You need to handle two cases:
-
The directory doesn't exist → create it
- The directory already exists → don't raise an error
Detailed hints
Step 1: Use the mkdir() method on the path object
Step 2: Add the parents parameter
parents=Truemeans "create parent directories if needed"- Example: If you're creating
data/raw/, it will also createdata/if it doesn't exist
Step 3: Add the exist_ok parameter
exist_ok=Truemeans "don't raise an error if the directory already exists"- Without this, calling the function twice would cause an error
Common mistakes
- ❌ Forgetting
parents=True→ fails if parent directory doesn't exist - ❌ Forgetting
exist_ok=True→ fails if directory already exists - ❌ Using
os.makedirs()instead ofPath.mkdir()→ works but not the modern way
TODO 2: write_synthetic_csv()
What you need to do
Generate fake data with 4 columns and save it as a CSV file.
Key concepts
- Use
numpyto generate random data - Use
pandasto organize data into a DataFrame - Save the DataFrame as CSV
Detailed hints
Step 1: Set the random seed
This ensures the same random data is generated every time with the same seed.Step 2: Generate the timestamp column
periods=n_rows→ create exactly n_rows timestampsfreq="h"→ one timestamp per hour
Step 3: Generate the user_id column (random integers from 1 to 10000)
- Note:
randint(1, 10001)gives you 1 to 10000 (upper bound is exclusive)
Step 4: Generate the value column (random floats from 0 to 100)
Step 5: Generate the category column (random choice from A, B, C, D, E)
Step 6: Create a DataFrame
df = pd.DataFrame({
"timestamp": timestamps,
"user_id": user_ids,
"value": values,
"category": categories
})
Step 7: Save to CSV
index=False→ don't write row numbers as a column
Step 8: Get file size
Step 9: Return metadata
Common mistakes
- ❌ Using
randint(1, 10000)→ gives you 1 to 9999 (upper bound is exclusive!) - ❌ Forgetting
index=False→ CSV will have an extra column with row numbers - ❌ Wrong column names → tests will fail
- ❌ Not setting the random seed → results won't be reproducible
TODO 3: time_it()
What you need to do
Run a function multiple times and measure how long each run takes.
Key concepts
- Use
time.perf_counter()for high-precision timing - Store all run times in a list
- Calculate the median (middle value)
Detailed hints
Step 1: Create an empty list to store times
Step 2: Loop repeats times
Step 3: Inside the loop, measure the time
start = time.perf_counter() # Record start time
fn() # Run the function
end = time.perf_counter() # Record end time
elapsed = end - start # Calculate elapsed time
times.append(elapsed) # Add to list
Step 4: Calculate the median
Step 5: Return the results
Why use median instead of mean?
- Median is less affected by outliers
- If one run is slow (e.g., due to background processes), it won't skew the result
- Median gives you the "typical" performance
Common mistakes
- ❌ Using
time.time()instead oftime.perf_counter()→ less precise - ❌ Calculating mean instead of median → more affected by outliers
- ❌ Forgetting to call
fn()→ you're timing nothing! - ❌ Timing the wrong thing (e.g., including the append operation)
TODO 4: read_csv_once()
What you need to do
Read a CSV file and return how many rows and columns it has.
Key concepts
- Use
pd.read_csv()to load the file - Use
.shapeto get dimensions
Detailed hints
Step 1: Read the CSV
Step 2: Get the shape
-df.shape returns a tuple: (n_rows, n_cols)
- Example: (200000, 4) means 200,000 rows and 4 columns
Common mistakes
- ❌ Returning
df.shape[0]anddf.shape[1]separately → return the tuple directly - ❌ Returning
len(df)→ only gives rows, not columns - ❌ Forgetting to return anything
TODO 5: write_parquet()
What you need to do
Read a CSV file and save it in Parquet format.
Key concepts
- Parquet is a binary, columnar storage format
- It's more efficient than CSV (smaller and faster)
Detailed hints
Step 1: Read the CSV
Step 2: Write to Parquet
Step 3: Get the Parquet file size
Step 4: Return metadata
What is Parquet?
- Columnar storage: Data is stored by column, not by row
- Compressed: Uses efficient compression algorithms
- Typed: Stores data type information (no need to parse strings)
- Fast: Faster to read/write than CSV for large datasets
Common mistakes
- ❌ Forgetting
index=False→ Parquet will include row numbers - ❌ Wrong dictionary keys → tests expect exact key names
TODO 6: read_parquet_once()
What you need to do
Read a Parquet file and return its shape.
Key concepts
- Same as
read_csv_once(), but for Parquet files
Detailed hints
Step 1: Read the Parquet file
Step 2: Return the shape
TODO 7: save_json()
What you need to do
Save a Python dictionary as a JSON file.
Key concepts
- Use the
jsonmodule - Use
indent=2for pretty-printing (human-readable)
Detailed hints
Step 1: Open the file in write mode
- The
withstatement ensures the file is properly closed
Step 2: Write the JSON
objis the dictionary to savefis the file objectindent=2makes it pretty (2-space indentation)
What does indent=2 do?
Without it:
With it:
Common mistakes
- ❌ Using
json.dumps()instead ofjson.dump()→dumpsreturns a string,dumpwrites to a file - ❌ Forgetting to open the file
- ❌ Not using
withstatement → file might not be properly closed
⚠️ Common Pitfalls
| Mistake | Fix |
|---|---|
randint(1, 10000) gives 1-9999 |
Use randint(1, 10001) for 1-10000 |
Forgot index=False in to_csv() |
Always use index=False |
Used time.time() instead of perf_counter() |
Use perf_counter() for precision |
Used json.dumps() instead of dump() |
dumps → string, dump → file |
Forgot parents=True in mkdir() |
Add parents=True, exist_ok=True |
| Calculated mean instead of median | Use np.median() for timing |
🎯 Testing Strategy
After implementing each function:
- Run the test cell - Does it pass?
- Read the error message - What went wrong?
- Check the assertion - What was expected vs. what you got?
- Debug - Add print statements to see intermediate values
- Iterate - Fix and try again
Example debugging
def write_synthetic_csv(csv_path, n_rows=200_000, seed=0):
np.random.seed(seed)
# ... your code ...
# Add debug prints:
print(f"DataFrame shape: {df.shape}")
print(f"Columns: {df.columns.tolist()}")
print(f"First row: {df.iloc[0]}")
df.to_csv(csv_path, index=False)
# ...
📊 Expected Results
When you complete the lab, you should see something like:
✓ All imports successful!
✓ ensure_dir() works correctly!
✓ write_synthetic_csv() works correctly!
✓ time_it() works correctly!
✓ read_csv_once() works correctly!
✓ write_parquet() works correctly!
✓ read_parquet_once() works correctly!
✓ save_json() works correctly!
==================================================
RESULTS SUMMARY
==================================================
CSV file size: 15-20 MB
Parquet file size: 2-4 MB
Size ratio: 5-8x (CSV is larger)
CSV median read time: 0.1-0.3 sec
Parquet median read time: 0.02-0.08 sec
Speedup: 2-5x (Parquet is faster)
==================================================
✓ Results saved to: ../results/lab01_metrics.json
Exact numbers depend on your system, but Parquet should always be:
- ✅ Smaller (5-10x)
- ✅ Faster to read (2-5x)
📦 Files to Submit
notebooks/lab01_setup_io.ipynb(with all cells executed)results/lab01_metrics.json(generated by the notebook)
Do NOT submit:
- CSV or Parquet files
- The entire repository
- Screenshots
🆘 Getting Help
If you're stuck:
- Read the error message carefully - It often tells you exactly what's wrong
- Check the docstring - Does your function return the right type?
- Look at the test - What does it expect?
- Use print statements - See what your code is actually doing
- Ask your instructor or TA - That's what they're here for!
🔗 Useful Links
🚀 Next Steps
After completing all TODOs:
- Run all cells from top to bottom
- Check that
results/lab01_metrics.jsonexists - Read your JSON file - does it look correct?
- Write your reflection
- Submit the notebook and JSON file
Good luck! 🎉