PySuricata
Generate clean, self-contained EDA reports for pandas/polars DataFrames and large in-memory chunked iterables.
Tip
Works great on small and medium datasets; for very large datasets, sample first.
Features
- Summary stats, missingness, duplicates
- Numeric, categorical, datetime, and boolean cards with inline SVG charts
- Out-of-core streaming for in-memory DataFrame chunks (low peak memory)
- Approximate distinct counts and heavy hitters for large columns
- Streaming correlations for numeric columns
- Self-contained HTML export (inline CSS/JS/images)
Quick links
Quick start
import pandas as pd
from pysuricata import profile, ReportConfig
df = pd.read_csv("/path/to/data.csv")
rep = profile(df, config=ReportConfig())
rep.save_html("report.html")
# Or stream in-memory chunks you create
def chunk_iter():
for i in range(10):
yield pd.read_csv(f"/data/part-{i}.csv")
rep = profile((ch for ch in chunk_iter()), config=ReportConfig())
rep.save_html("report.html")
How it works
- Reads input in chunks (pandas DataFrames) and feeds type-specific accumulators.
- Numeric accumulators maintain Welford/Pébay moments, a reservoir sample, and KMV distinct.
- Categorical accumulators use Misra–Gries for top-k and KMV for distinct.
- Datetime accumulators count by hour/day/month and keep min/max.
- A lightweight streaming correlation estimator tracks Pearson r for numeric pairs.
- The template renders a self-contained HTML with precise duration (e.g., 0.02s) and processed bytes (≈).
- Deterministic visuals via
ReportConfig.compute.random_seed
.
See also: Numeric analysis details in numeric_var.md.