Lab 01: Environment Setup and Basic I/O Benchmarking
Welcome to your first Big Data laboratory session! This lab will help you verify your development environment and introduce you to simple performance measurement.
📚 Additional Resources
- Tips & Reference Guide - Complete guide with detailed tips, code examples, and quick reference
🎯 What You Will Learn
- How to setup and verify your Python environment is working correctly with
uvand VS Code - Basic file I/O operations: reading and writing CSV and Parquet formats
- Simple time measurement using Python's built-in
timemodule
✅ Pre-flight Checklist
Before starting the lab, make sure your environment is ready. Open your terminal (Git Bash on Windows, Terminal on macOS/Linux) and run these commands:
Check Python version
Expected output:Python 3.11.x or Python 3.12.x
Check uv is installed
Expected output: uv 0.x.x (any recent version)
Check required libraries
Expected output:All imports OK ✓
If any of these commands fail, refer to the Environment Setup section below.
🛠️ Environment Setup from Scratch
If you're starting fresh, follow these steps to install all required tools.
1. Install Python
Windows:
- Download Python from python.org
- Run the installer and check "Add Python to PATH"
- Complete the installation
macOS:
Linux (Ubuntu/Debian):
Verify installation:
2. Install VS Code
- Download VS Code from code.visualstudio.com
- Install the Python extension by Microsoft (search "Python" in the Extensions panel)
- Install the Jupyter extension by Microsoft (search "Jupyter")
3. Install Git
Windows:
- Download and install from git-scm.com
- This also installs Git Bash, which we'll use for terminal commands
macOS:
Linux:
Verify:
4. Install uv Package Manager
uv is a modern, fast Python package manager we'll use throughout this course. Install it using pip:
Verify installation:
Expected output: uv 0.x.x (any recent version)
Note:
uvis much faster than traditionalpipfor installing packages and managing virtual environments, which is why we use it in this course.
5. Clone the Repository
Navigate to where you want to store the course files:
# Example: Go to your Documents folder
cd ~/Documents
# Clone the repository (replace YOUR_USERNAME with your GitHub username)
git clone https://github.com/alvarodiez20/bigdata.git
# Enter the project directory
cd bigdata
6. Install Dependencies
Install all required Python packages using uv:
This creates a virtual environment (.venv) and installs pandas, pyarrow, jupyter, and other dependencies listed in pyproject.toml.
💻 Working with VS Code
Step 1: Open the Project Folder
- Open VS Code
- File → Open Folder...
- Select the
bigdatafolder you just cloned - Trust the folder when prompted
Step 2: Select the Python Interpreter
- Press
Ctrl+Shift+P(Windows/Linux) orCmd+Shift+P(macOS) - Type "Python: Select Interpreter"
- Choose the interpreter from
.venv(it will show something like./venv/bin/python)
Step 3: Open the Notebook
- In the Explorer panel (left sidebar), navigate to
notebooks/ - Click on
lab01_setup_io.ipynb - The notebook will open in VS Code
Step 4: Select the Kernel
- In the top-right corner of the notebook, click "Select Kernel"
- Choose "Python Environments..."
- Select the
.venvinterpreter (same as Step 2)
Step 5: Run Your First Cell
- Click on the first cell in the notebook
- Press
Shift+Enterto run it - You should see the output below the cell
✅ If you see output without errors, you're ready to go!
📝 Lab Steps
Follow along in the notebook notebooks/lab01_setup_io.ipynb. Here's what you'll do:
A. Create Required Folders
Make sure these folders exist:
data/raw/— for raw CSV filesdata/processed/— for processed Parquet filesresults/— for benchmark results
The notebook will create them automatically using your ensure_dir() function.
B. Generate a Tiny Synthetic Dataset
You'll write a function write_synthetic_csv() that:
- Creates a simple DataFrame with 200,000 rows
- Columns:
timestamp,user_id,value,category - Saves it to
data/raw/synthetic.csv - Returns metadata (number of rows, columns, file size)
C. Time Reading the CSV (3 Repeats)
You'll implement time_it() to measure how long it takes to read the CSV file.
- Use
time.perf_counter()for precise timing - Repeat the read operation 3 times
- Calculate the median time
D. Convert CSV to Parquet
You'll write write_parquet() to:
- Read the CSV file
- Save it as
data/processed/synthetic.parquet - Return metadata (file size, rows, columns)
E. Time Reading the Parquet (3 Repeats)
Same as step C, but for the Parquet file.
F. Save Results to JSON
You'll implement save_json() to save your benchmark results to:
The JSON will include:
- CSV read times
- Parquet read times
- File sizes
- Speedup ratio (CSV time / Parquet time)
- Your reflection (3 lines of text)
G. (Optional) Out-of-Memory Test
⚠️ Warning: This is an optional advanced section that will intentionally try to crash your Python kernel!
You'll create a function that allocates increasingly large arrays until your system runs out of memory. This teaches you:
- What happens when data doesn't fit in RAM
- How to recognize OOM (Out of Memory) errors
- Why understanding memory limits is crucial in Big Data
What you'll see:
- On some systems: A clean
MemoryErrorexception - On others: The kernel will crash silently (OS kills the process to protect the system)
This is completely safe - it only affects the notebook kernel, not your computer. Just restart the kernel after the test.
🐛 Common Errors and Fixes
1. ModuleNotFoundError: No module named 'pandas'
Problem: Dependencies not installed.
Fix:
# Make sure you're in the project folder
cd ~/Documents/bigdata
# Sync dependencies
uv sync
# Always run Python commands with 'uv run'
uv run python -c "import pandas; print('OK')"
2. Kernel not found in VS Code
Problem: VS Code can't find the .venv interpreter.
Fix:
- Close and reopen VS Code
- Press
Ctrl+Shift+P→ "Python: Select Interpreter" - If
.venvdoesn't appear, manually enter the path:.venv/bin/python(macOS/Linux) or.venv\Scripts\python.exe(Windows)
3. FileNotFoundError: [Errno 2] No such file or directory: 'data/raw/...'
Problem: Folders don't exist yet.
Fix: The notebook should create them automatically with ensure_dir(). Make sure you run all cells in order from top to bottom.
4. Notebook cells take forever to run
Problem: The dataset might be too large or your system is slow.
Fix: In the notebook, reduce n_rows from 200,000 to 50,000 or 100,000.
5. uv: command not found
Problem: uv is not installed or not in your PATH.
Fix:
If you still have issues, close and reopen your terminal, then try again.
📦 What to Submit
Submit exactly these two files via your course platform (e.g., Moodle, email, etc.):
notebooks/lab01_setup_io.ipynb— Your completed notebook with all cells executedresults/lab01_metrics.json— The JSON file generated by the notebook
Do NOT submit:
- The CSV or Parquet files
- The entire repository
- Screenshots (unless explicitly requested)
🎓 Reflections
At the end of the notebook, you'll be asked to write a short reflection (3 lines) answering:
- What surprised you about the performance difference?
- Why do you think Parquet is faster/smaller?
This reflection will be saved in your lab01_metrics.json file.
🚀 Next Steps
After completing this lab:
- Make sure both deliverable files are ready
- Check your
results/lab01_metrics.jsoncontains all expected fields - Review your reflection — does it make sense?
- Submit the files!
Questions? Ask your instructor. Happy coding! 🎉