Architecture Diagrams

Visual reference for PySuricata's processing pipeline, accumulator internals, chunk processing, and rendering — annotated with algorithmic complexity.

1. Main Processing Pipeline

End-to-end data flow from user input to final Report object.

flowchart TD
    A["User Input\npd.DataFrame | pl.DataFrame\npl.LazyFrame | Iterable"] -->|"O(1)"| B["api.py — profile / summarize\n_coerce_input · _to_engine_config"]
    B -->|"O(1)"| C["report.py — ReportOrchestrator.build_report"]
    C -->|"O(1)"| D["engine.py — StreamingEngine.process_stream"]
    D -->|"O(1)"| E["EngineManager.select_adapter\nPandasAdapter | PolarsAdapter"]
    E -->|"O(1)"| F["AdaptiveChunker.chunks_from_source\nStrategy: ADAPTIVE | FIXED | MEMORY_AWARE"]
    F -->|"O(n/c) chunks"| G["First Chunk — infer_and_build\nO(cols): type inference + accumulator creation"]
    G --> K{"Stream Loop\nO(n/c) iterations"}
    K -->|"O(cols × chunk)"| L["consume_chunk\nPer-column accumulator updates"]
    K -->|"O(m² × chunk)"| M["update_corr\nPairwise sums for Pearson r\nm = numeric cols"]
    K -->|"O(cols × chunk)"| N["update_row_kmv\nRow hash for duplicate detection"]
    L --> K
    M --> K
    N --> K
    K -->|"Done"| O["_build_manifest_inputs\nO(cols): kinds_map · col_order · miss_list"]
    O --> P["_apply_correlation_chips\nO(cols): attach top correlations"]
    P --> Q["render_html_snapshot\nO(cols × card_render)"]
    Q --> R["_build_summary\nO(cols): acc.finalize per column"]
    R --> S["Report — html + stats"]

Complexity Summary

Metric	Value
Total Time	O(n_rows × n_cols) + O(m² × n_rows) for correlations
Total Space	O(cols × (s + k + b))
Peak Memory	~50 MB bounded by streaming architecture

Where: n = total rows, c = chunk size (200k default), m = numeric columns, s = reservoir size (20k), k = KMV sketch size (2048), b = histogram bins (25).

2. Accumulator Architecture

Each column type has a specialized accumulator with small, bounded state.

flowchart TD
    subgraph Numeric["NumericAccumulator — Space: O(s + k + b) per column"]
        SM["StreamingMoments — Welford/Chan\nTime: O(n) per chunk | Space: O(1)\nmean · variance · skewness · kurtosis"]
        RS["ReservoirSampler — k = 20 000\nTime: O(1) add | Space: O(s)\nquantiles · histogram source"]
        KMV_N["KMV Sketch — k = 2 048\nTime: O(log k) add | Space: O(k)\napprox distinct count"]
        ET["ExtremeTracker — k = 5 bounded heaps\nTime: O(n log k) update | Space: O(k)\nmin/max with row indices"]
        MG_N["MisraGries — k = 50\nTime: O(1) add | Space: O(k)\ntop-k values (integer-like cols)"]
        SH["StreamingHistogram — bins = 25\nTime: O(1) add | Space: O(b)\ntrue distribution approximation"]
        OD["OutlierDetector\nTime: O(s) finalize | Space: O(1)\nIQR + MAD methods"]
    end

    subgraph Cat["CategoricalAccumulator — Space: O(k) per column"]
        KMV_C["KMV Sketch\nTime: O(log k) add | Space: O(k)\napprox distinct count"]
        MG_C["MisraGries\nTime: O(1) add | Space: O(k)\ntop-k categories"]
        SL["String Length Tracker\nTime: O(1) per value | Space: O(1)\navg_len · p90 · empty count"]
    end

    subgraph DT["DatetimeAccumulator — Space: O(1) per column"]
        DMM["Min/Max — nanosecond timestamps\nTime: O(1) | Space: O(1)"]
        DFC["Frequency Counters\nday-of-week · hour · month\nTime: O(1) | Space: O(1)"]
    end

    subgraph Bool["BooleanAccumulator — Space: O(1) per column"]
        BCT["True / False / Missing counters\nTime: O(1) per value | Space: O(1)"]
    end

Per-Column Memory Budget

Accumulator	Default Config	Approx Memory
NumericAccumulator	s=20k, k=2048, b=25	~170 KB
CategoricalAccumulator	k=2048+50	~20 KB
DatetimeAccumulator	—	< 1 KB
BooleanAccumulator	—	< 1 KB

3. Chunk Processing Detail

What happens inside consume_chunk() for each column in a chunk.

flowchart TD
    A["Incoming Chunk\npd.DataFrame or pl.DataFrame"] --> B{"New columns\nin this chunk?"}
    B -->|"Yes"| C["UnifiedTypeInferrer.infer_series_type\n+ create accumulator via factory\nO(sample_size) per new column"]
    B -->|"No"| D["Iterate accs.items — O(cols)"]
    C --> D

    D --> E{"Column type?"}

    E -->|"Numeric"| F["to_numpy(float64)\nFast path: direct cast\nSlow path: pd.to_numeric(errors=coerce)\nO(chunk_size)"]
    F --> F1["acc.update(arr)\nmoments + reservoir + histogram\nO(chunk_size)"]
    F1 --> F2["KMV.add per value\nO(chunk_size × log k)"]
    F2 --> F3["ExtremeTracker.update\nO(chunk_size × log k)\nEvery 5th chunk only (pandas)"]
    F3 --> G["Next column"]

    E -->|"Categorical"| H["s.tolist → Python list\nO(chunk_size)"]
    H --> H1["KMV + MisraGries + string stats\nO(chunk_size)"]
    H1 --> G

    E -->|"Boolean"| I["Per-value str coercion\nstr(v).strip().lower()\nO(chunk_size) — Python loop"]
    I --> I1["True/False/Missing counting\nO(chunk_size)"]
    I1 --> G

    E -->|"Datetime"| J["pd.to_datetime(errors=coerce, utc=True)\nO(chunk_size)"]
    J --> J1["Min/max + frequency counters\nO(chunk_size)"]
    J1 --> G

    G --> K{"More columns?"}
    K -->|"Yes"| D
    K -->|"No"| L["Return to engine loop\n+ update memory estimate"]

Per-Chunk Bottleneck Analysis

Operation	Complexity	Notes
Numeric KMV loop	O(chunk × log k)	Per-value Python → batch vectorizable
Boolean coercion	O(chunk) Python	str conversion per value → vectorizable
Correlation update	O(m² × chunk)	Pairwise; gated by `corr_max_cols=50`
Row KMV hashing	O(chunk × cols)	Per-row tuple hashing → vectorizable
Type conversion	O(chunk)	pd.to_numeric / pd.to_datetime

4. Rendering Pipeline

How the final HTML report is assembled from accumulated statistics.

flowchart TD
    A["render_html_snapshot\nrender/html.py"] --> B["Build kinds_map\nO(cols): name → (kind, accumulator)"]
    B --> C["Compute miss_list\nO(cols): per-column missing percentage"]
    C --> D["Dataset metrics\nrow_kmv.approx_duplicates\nconstant / high-cardinality detection"]
    D --> E["Card Loop — O(cols)"]
    E --> F["acc.finalize(chunk_metadata)\nquantiles O(s log s)\nextremes O(k log k)\noutlier stats O(s)"]
    F --> G{"Card type?"}

    G -->|"Numeric"| G1["NumericCardRenderer\nSVG histogram + stats tables\n~200 lines HTML per card"]
    G -->|"Categorical"| G2["CategoricalCardRenderer\nDonut chart SVG + top-k table"]
    G -->|"Datetime"| G3["DatetimeCardRenderer\nTemporal bar charts + freq tables"]
    G -->|"Boolean"| G4["BooleanCardRenderer\nTrue/False bar + percentage"]

    G1 --> H["Collect card HTML strings"]
    G2 --> H
    G3 --> H
    G4 --> H

    H --> I["Load + inline static assets\nstyle.css · functionality.js\nchart.min.js (inlined)"]
    I --> J["CorrelationsSectionRenderer\nrender_section — O(m²)"]
    J --> K_node["MissingValuesSectionRenderer\nrender_section — O(cols)"]
    K_node --> L["DonutChartRenderer\nrender_dtype_donut — SVG"]
    L --> M["Template assembly\nreport_template.html.format\n~40 template variables"]
    M --> N["Self-contained HTML\n~1.2–1.6 MB typical"]

Rendering Cost Breakdown

Phase	Complexity	Output Size
Finalization	O(cols × s log s)	—
Card rendering	O(cols)	~5–20 KB per card
Asset inlining	O(1)	~200 KB (CSS+JS)
Correlation section	O(m²)	Variable
Template assembly	O(1)	~1.2–1.6 MB total

5. End-to-End Complexity Table

Stage	Time	Space	Key Parameter
Input coercion	O(1)	O(1)	—
Chunking	O(n/c)	O(c × cols)	`chunk_size` (200k)
Type inference	O(sample)	O(1)	first chunk only
Accumulator updates	O(n × cols)	O(cols × s)	`numeric_sample_k` (20k)
KMV sketching	O(n × log k)	O(cols × k)	`uniques_k` (2048)
Correlation	O(n × m²)	O(m²)	`corr_max_cols` (50)
Row deduplication	O(n × cols)	O(k)	KMV sketch
Finalization	O(cols × s log s)	O(cols)	—
HTML rendering	O(cols)	O(report_size)	~1.5 MB
Total	O(n × cols + n × m²)	O(cols × s)	—