Lab 07: Probabilistic Data Structures & Polars — Instructions
Welcome to Lab 07! This lab has two parts:
- Part A: Implement probabilistic data structures (HyperLogLog, t-digest) and explore hash function quality, validated with
pytest. - Part B: Learn Polars and use PySuricata to see streaming algorithms in action.
Additional Resources
- Tips & Reference Guide — theoretical explanations, formulas, and Polars overview.
Pre-flight Checklist
- Checkout the
mainbranch:git checkout main - Pull the latest changes from the repository:
git pull - Create a local branch for your work:
git checkout -b <your_branch_name> - Ensure you have the required dependencies updated. Run:
- Run the test suites (all tests will fail until you implement the functions):
Part A — Probabilistic Data Structures
You will edit src/lab07.py. Replace each NotImplementedError("TODO: ...") with your implementation.
Exercise 1: Hash Function Quality (TODOs 1–2)
Objective
Understand why hash function quality is critical for probabilistic data structures by comparing a deliberately bad hash against a proper one.
bad_hash(item, table_size)(TODO 1): Implement a deliberately poor hash function. Uselen(str(item)) % table_size. This produces very few distinct outputs.good_hash(item, table_size)(TODO 2): Implement a proper hash usinghashlib.sha256. Convert the item to bytes, hash it, convert the digest to an integer, and take modulotable_size.
Validation: Run uv run pytest tests/test_lab07.py -k "test_hash"
Exercise 2: HyperLogLog (TODOs 3–6)
Objective
Estimate the number of distinct elements in a stream using \(O(\log \log n)\) memory, implementing the HyperLogLog algorithm.
Implement the HyperLogLog class:
_hash(item)(TODO 3): Hash an item usinghashlib.sha256and return an integer. Convert the hex digest to an integer withint(digest, 16)._leading_zeros(hash_val, max_bits)(TODO 4): Count the number of leading zeros in the binary representation of the remaining bits (after removing the bucket index bits). Start frommax_bits - 1and count down while each bit is 0. Return at least 1.add(item)(TODO 5): Hash the item, extract the bucket index from the firstpbits, count leading zeros from the remaining bits, and update the register with the maximum value.-
estimate()(TODO 6): Compute the cardinality estimate using the harmonic mean formula and bias correction. Apply small range correction when appropriate (see guide for formulas).The formulas you need:
- Bias correction: \(\alpha_m = \frac{0.7213}{1 + 1.079/m}\)
- Raw estimate: \(E = \alpha_m \cdot m^2 \cdot \left(\sum_{j=0}^{m-1} 2^{-\text{registers}[j]}\right)^{-1}\)
- Small range: if \(E \le 2.5 \cdot m\) and any register is 0, use \(E^* = m \cdot \ln(m / V)\) where \(V\) = number of zero registers.
Validation: Run uv run pytest tests/test_lab07.py -k "test_hyperloglog"
Exercise 3: T-Digest (TODOs 7–9)
Objective
Approximate streaming quantiles (median, p99, etc.) without storing all values, using the t-digest algorithm.
Implement the TDigest class. The Centroid dataclass and _compress() method are provided for you.
add(value)(TODO 7): Create a newCentroid(mean=value, weight=1)and append it to the internal list. If the number of centroids exceedsmax_unmerged, call_compress().quantile(q)(TODO 8): Walk through the sorted centroids, accumulating weight. When the accumulated weight crossesq * total_weight, return the mean of the current centroid. Return the last centroid's mean ifqis very close to 1.0. Handle edge cases first: return 0.0 if no centroids, return the single centroid's mean if only one.merge(other)(TODO 9): Merge another TDigest into this one. Extend this digest's centroids with the other's centroids, then call_compress().
Understanding _compress()
The provided _compress() method sorts centroids by mean, then greedily merges adjacent centroids as long as their combined weight stays within the scale function limit: \(\text{max\_weight}(q) = 4 \cdot \frac{n}{\delta} \cdot q \cdot (1 - q)\). This keeps tail centroids small (high precision) and center centroids large (lower precision).
Validation: Run uv run pytest tests/test_lab07.py -k "test_tdigest"
Part B — Polars & PySuricata
You will edit src/lab07_polars.py. Exercises 4a–4d are validated with pytest using synthetic data (no download needed). Exercise 5 is run directly as a script.
Exercise 4: Introduction to Polars (TODOs 10–13)
Objective
Get hands-on experience with Polars as a modern alternative to pandas. Understand lazy vs eager evaluation using a real-world dataset with ~3 million rows.
The script automatically downloads the NYC Yellow Taxi Trip Records dataset (~45 MB parquet, ~3 million rows) on first run.
load_taxi_eager(path)(TODO 10): Usepl.read_parquet()to load the dataset eagerly. Notice the brief pause as all ~3M rows load.load_taxi_lazy(path)(TODO 11): Usepl.scan_parquet()to create a LazyFrame. Notice how it returns instantly — no data is loaded yet.filter_and_group(df)(TODO 12): Filter trips wheretrip_distance > 2.0, group byPULocationID, and compute the meanfare_amount. Usepl.col()expressions.add_computed_column(df)(TODO 13): Add a new column"tip_percentage"computed as(tip_amount / total_amount) * 100usingwith_columns. Handle division by zero with.fill_nan(0.0).fill_null(0.0).
Validation: Run uv run pytest tests/test_lab07_polars.py -v
Run the full script against the real dataset to see the results:
Exercise 5: PySuricata with Polars (TODOs 14–15)
Objective
Use PySuricata to generate a streaming data profile via Polars, and connect the report to the algorithms from Labs 06 and 07. With ~3 million rows, you will see PySuricata's streaming architecture in action.
generate_report(lf)(TODO 14): Callpysuricata.profile(lf)to profile the LazyFrame, then save the report as"taxi_report.html". Watch PySuricata process the data in chunks — this is exactly the streaming model we studied in class.- Reflection (TODO 15): In the
STUDENT REFLECTIONsection at the top of the file, answer:- Which streaming algorithms from Lab 06 can you identify in the PySuricata report?
- How does PySuricata handle datasets larger than memory?
- What advantage does using a
LazyFramegive PySuricata compared to an eagerDataFrame?
Run the script and open the generated report:
What to Submit
When you are finished and both test suites pass, you are done with Parts A and B:
Before submitting, make sure to write a short paragraph in the STUDENT REFLECTION section at the top of both files.
Submit exactly:
src/lab07.py— Your completed probabilistic data structures.src/lab07_polars.py— Your Polars and PySuricata exercises.taxi_report.html— The PySuricata report you generated.
Do NOT submit: Notebooks, the __pycache__ directories, or the downloaded parquet file.
Questions? Check the Tips & Reference Guide or ask your instructor.