Architecture & Internals
This document explains how pysuricata
profiles data efficiently and renders a self‑contained HTML report.
Overview
┌────────┐ ┌──────────────┐ ┌────────────────────┐ ┌──────────────┐
│ Source │ → │ Chunk iterator│ → │ Typed accumulators │ → │ HTML template │
└────────┘ └──────────────┘ └────────────────────┘ └──────────────┘
In-memory DataFrame(s) numeric / categorical / datetime / boolean
Chunk ingestion
- Iterable of pandas DataFrames: consumed as-is.
- Single pandas DataFrame: treated as one chunk (or sliced by rows if you pre-split it).
Typed accumulators
Each column kind is handled by a specialized accumulator with small, mergeable state:
- NumericAccumulator
- Moments (n, mean, M2, M3, M4) via Welford/Pébay (exact, mergeable)
- Min/Max, zeros, negatives, ±inf, missing counters
- Reservoir sample (default 20k) for quantiles, histograms, and shape hints
- KMV (K‑Minimum Values) for approximate distinct
- Misra–Gries top‑k for discrete integer‑like columns (on demand)
- Heaping %, granularity (decimals/step), bimodality hint
- Streaming correlation chips (optional, numeric vs numeric)
-
Extremes with row indices (min / max tracked across chunks)
-
CategoricalAccumulator
- KMV for distinct, Misra–Gries for top‑k
- String length stats (avg, p90), empty strings
-
Case/trim variant distinctness
-
DatetimeAccumulator
- Min/Max timestamps (ns), counts by hour / day of week / month
-
Monotonicity hints
-
BooleanAccumulator
- True/False counts, missing, imbalance hints
All accumulators expose update(...)
and finalize() → SummaryDataclass
for rendering.
Streaming correlations
_StreamingCorr
maintains pairwise sufficient statistics for numeric columns and emits top absolute correlations above a configurable threshold for each column.
Rendering pipeline
- Infer column kinds from the first chunk.
- Build accumulators and consume the first chunk.
- Consume remaining chunks, update streaming correlations if enabled.
- Compute summary metrics (missingness, duplicates, constant columns, etc.).
- Render the template with:
- Summary cards (rows, cols, processed bytes (≈), missing/duplicates)
- Top missing columns
- Variables (cards by type)
- Optional dataset sample
The template is a single file with inline CSS/JS/images to produce a portable HTML.
Shared helpers (deduped)
Rendering utilities live in two small modules for reuse and testability:
pysuricata/render/svg_utils.py
safe_col_id
,nice_ticks
,fmt_tick
,svg_empty
pysuricata/render/format_utils.py
human_bytes
,fmt_num
,fmt_compact
These power both the main report and the individual variable cards with consistent tick/label formatting.
Configuration
ReportConfig
controls chunk size, sample sizes, distinct/top‑k sketch sizes, and correlation settings, plus logging and checkpointing. It also exposes random_seed
to make sampling deterministic for reproducible visuals.
Key fields:
chunk_size
: rows per chunk (default 200k)numeric_sample_k
: reservoir size for numeric sampling (default 20k)uniques_k
: KMV sketch size (default 2048)topk_k
: Misra–Gries capacity (default 50)compute_correlations
: enable/disable streaming correlation chipscorr_threshold
,corr_max_cols
,corr_max_per_col
include_sample
,sample_rows
- Checkpointing: write periodic pickles and (optional) partial HTML
Processed bytes & timing
The report shows:
- Processed bytes (≈): cumulative bytes processed across chunks (not process RSS)
- Precise generation time in seconds (e.g., 0.02s
)
Security & correctness notes
- HTML escaping: column names, labels, and chip text are escaped before rendering.
- Missing/inf handling: NaN and ±Inf are excluded from moment calculations but reported separately.
- Approximation badges: estimates are marked with
(≈)
or anapprox
badge.
Extending
- Add backends: polars/Arrow datasets or DuckDB scans can be plugged into the chunk iterator.
- Add quantile sketches: t‑digest or KLL can replace the default reservoir for better tail accuracy.
- Add new sections: drift comparisons, profile JSON export to file, CLI wrapper.