Architecture & Internals
How pysuricata profiles data efficiently and renders a self-contained HTML report.
High-Level Pipeline
flowchart LR
A["Data Source"] --> B["Chunk Iterator"]
B --> C["Typed Accumulators"]
C --> D["Summary Metrics"]
D --> E["HTML Renderer"]
style A fill:#E8F5E9,stroke:#2E7D32,color:#1B5E20
style B fill:#C8E6C9,stroke:#2E7D32,color:#1B5E20
style C fill:#A5D6A7,stroke:#2E7D32,color:#1B5E20
style D fill:#81C784,stroke:#2E7D32,color:#1B5E20
style E fill:#66BB6A,stroke:#2E7D32,color:#fff
Data Sources → pandas DataFrames, polars DataFrames, or any iterable of DataFrames (for streaming).
Chunk Iterator → If a single DataFrame is passed, it is treated as one chunk. Generators are consumed chunk-by-chunk to bound memory.
Typed Accumulators → Each column is assigned a specialized accumulator based on its inferred type. All accumulators are streaming: they accept one chunk at a time and maintain bounded state.
Summary Metrics → After all chunks are consumed, accumulators are finalized and dataset-wide metrics (missingness, duplicates, etc.) are computed.
HTML Renderer → A single-file Jinja2 template with inline CSS/JS produces a portable, self-contained HTML report.
Accumulator Architecture
classDiagram
class BaseAccumulator {
+name: str
+count: int
+missing: int
+update(chunk)
+finalize() Summary
}
class NumericAccumulator {
+StreamingMoments
+ReservoirSampler
+KMV sketch
+MisraGries top-k
+ExtremeTracker
}
class CategoricalAccumulator {
+KMV sketch × 3
+MisraGries top-k
+String length stats
}
class DatetimeAccumulator {
+min/max timestamps
+hour/weekday/month counts
+monotonicity tracker
}
class BooleanAccumulator {
+true_count
+false_count
}
BaseAccumulator <|-- NumericAccumulator
BaseAccumulator <|-- CategoricalAccumulator
BaseAccumulator <|-- DatetimeAccumulator
BaseAccumulator <|-- BooleanAccumulator
Each accumulator follows the same interface:
update(chunk)— process a batch of values, update internal statefinalize()— compute final statistics from accumulated state
Data Flow
sequenceDiagram
participant User
participant profile as profile()
participant Infer as Type Inference
participant Acc as Accumulators
participant Corr as Correlations
participant Render as HTML Renderer
User->>profile: DataFrame / generator
profile->>Infer: First chunk
Infer-->>profile: Column types
profile->>Acc: Create typed accumulators
loop Each chunk
profile->>Acc: update(chunk)
profile->>Corr: update pairs (optional)
end
profile->>Acc: finalize()
Acc-->>profile: Per-column summaries
profile->>Corr: finalize()
Corr-->>profile: Correlation matrix
profile->>Render: Summaries + config
Render-->>User: Report (HTML)
Streaming Algorithms
Each accumulator uses algorithms chosen for O(1) per-value update and bounded memory:
flowchart TB
subgraph Numeric["Numeric Accumulator"]
N1["Welford/Pébay<br/>mean, var, skew, kurt<br/>O(1) space"]
N2["Reservoir Sampling<br/>quantiles, histograms<br/>O(s) space"]
N3["KMV Sketch<br/>distinct count<br/>O(k) space"]
N4["Misra-Gries<br/>top-k values<br/>O(k) space"]
N5["Extreme Tracker<br/>min/max with indices<br/>O(k) space"]
end
subgraph Categorical["Categorical Accumulator"]
C1["KMV × 3<br/>distinct: original, lower, trimmed"]
C2["Misra-Gries<br/>top-k values"]
C3["String Length<br/>avg, p90"]
end
subgraph DateTime["DateTime Accumulator"]
D1["Min/Max<br/>timestamps"]
D2["Counters<br/>hour/weekday/month"]
D3["Monotonicity<br/>pair comparison"]
end
subgraph Boolean["Boolean Accumulator"]
B1["Counters<br/>true/false/missing"]
end
style Numeric fill:#E8F5E9,stroke:#2E7D32
style Categorical fill:#FFF3E0,stroke:#E65100
style DateTime fill:#E3F2FD,stroke:#1565C0
style Boolean fill:#F3E5F5,stroke:#6A1B9A
Rendering Pipeline
flowchart TB
A["Finalized Summaries"] --> B["Dataset-Level Metrics"]
B --> C["Jinja2 Template"]
C --> D["Inline CSS + JS"]
C --> E["Summary Cards"]
C --> F["Variable Cards"]
C --> G["Sample Table"]
D --> H["Single HTML File"]
E --> H
F --> H
G --> H
style H fill:#66BB6A,stroke:#2E7D32,color:#fff
The template produces a single portable HTML file — no external dependencies, no server required.
Summary cards show: rows, columns, processed bytes, missing %, duplicates %.
Variable cards are rendered per-type with SVG charts, statistics, and quality flags.
Shared Utilities
| Module | Functions | Purpose |
|---|---|---|
render/svg_utils.py |
safe_col_id, nice_ticks, fmt_tick, svg_empty |
SVG chart helpers |
render/format_utils.py |
human_bytes, fmt_num, fmt_compact |
Number formatting |
Configuration
ReportConfig controls all behavior:
| Parameter | Default | Effect |
|---|---|---|
chunk_size |
200,000 | Rows per chunk |
numeric_sample_size |
20,000 | Reservoir size for quantiles |
uniques_sketch_size |
2,048 | KMV sketch size |
top_k_size |
50 | Misra-Gries capacity |
compute_correlations |
True |
Enable/disable correlation chips |
corr_threshold |
0.5 | Minimum |r| to display |
random_seed |
None |
Deterministic sampling |
include_sample |
True |
Include data sample in report |
Security & Correctness
- HTML escaping — column names and labels are escaped before rendering
- Missing/Inf handling — NaN and ±Inf excluded from moments, reported separately
- Approximation badges — estimates marked with
(≈)orapproxbadge - Reproducibility — set
random_seedfor deterministic results
Extending
- Backends — polars/Arrow/DuckDB can be connected via the chunk iterator interface
- Quantile sketches — t-digest or KLL can replace the default reservoir
- New sections — drift comparisons, JSON export, CLI wrapper