Complexity Analysis
Overview
PySuricata processes data in a single pass using streaming algorithms. All statistics are computed incrementally from small, fixed-size state — memory usage depends on configuration parameters, not dataset size.
Key property: For a dataset with n rows and p columns, PySuricata uses O(p · k) memory where k is the sketch/sample size (constant), making total memory independent of n.
Summary
| Component | Time per value | Space | Dominant parameter |
|---|---|---|---|
| Streaming moments | O(1) | O(1) | — |
| Reservoir sampling | O(1) | O(s) | numeric_sample_size (default: 20,000) |
| KMV distinct count | O(log k) | O(k) | uniques_sketch_size (default: 2,048) |
| Misra-Gries top-k | O(1) amortized | O(k) | top_k_size (default: 50) |
| Extreme tracking | O(log k) | O(k) | max_extremes (default: 5) |
| Correlations | O(p) per value | O(p²) | Number of numeric columns |
Total per numeric column: O(s + k) ≈ 22 KB with defaults
Total per categorical column: O(k) ≈ 18 KB with defaults
DateTime and Boolean columns: O(1)
Algorithms
Streaming Moments (Welford/Pébay)
Purpose: Exact mean, variance, skewness, kurtosis in a single pass.
- Update: O(1) per value — maintains running sums of powers of deviations
- Merge: O(1) — Pébay's formulas combine two partial states exactly
- Finalize: O(1)
- Space: O(1) — six floating-point accumulators (n, μ, M₂, M₃, M₄)
This is numerically stable — it avoids the catastrophic cancellation of naïve Σx² − nμ² formulas.
Reference: Welford (1962), Pébay (2008)
Reservoir Sampling
Purpose: Maintain a uniform random sample for quantile estimation and histograms.
- Update: O(1) per value — include with probability s/i, replacing a random element
- Merge: O(s) — weighted combination of two reservoirs
- Finalize: O(s log s) — sort for quantile computation
- Space: O(s) where s =
numeric_sample_size
Every subset of size s from the stream has equal probability of being the sample.
Reference: Vitter (1985)
KMV (K-Minimum Values) Sketch
Purpose: Approximate distinct count estimation.
- Update: O(log k) — hash value, maintain min-heap of k smallest hashes
- Merge: O(k log k) — union two heaps, keep k smallest
- Estimate: d̂ = (k − 1) / x_k where x_k is the k-th smallest hash
- Space: O(k) where k =
uniques_sketch_size - Error: ~1/√k relative error — with k=2048, approximately 2.2%
The implementation transitions from exact counting (for low-cardinality columns) to approximate estimation when the number of distinct values exceeds a threshold.
Reference: Bar-Yossef et al. (2002)
Misra-Gries (Top-K Frequent Values)
Purpose: Find the most frequent values in a stream.
- Update: O(1) amortized — increment counter or decrement all
- Merge: O(k) — sum counters from two states
- Finalize: O(k log k) — sort by frequency
- Space: O(k) where k =
top_k_size - Guarantee: Any value with true frequency > n/k is included in the output
Frequency estimates are within ±n/k of the true count.
Reference: Misra & Gries (1982)
Extreme Tracking
Purpose: Track the minimum and maximum values with their row indices.
- Update: O(log k) per value — bounded heap operations
- Merge: O(k log k) — merge and truncate heaps
- Space: O(k) where k =
max_extremes(default: 5)
Uses a min-heap for minimums and a negated max-heap for maximums.
Correlations
Purpose: Streaming Pearson correlation between all pairs of numeric columns.
- Update: O(p) per value — update co-moment for each column pair
- Finalize: O(p²) — compute correlations from co-moments
- Space: O(p²) — stores pairwise co-moments
This is the most expensive component. For datasets with many numeric columns, disable with compute_correlations = False or increase corr_threshold.
Accumulator Breakdown
NumericAccumulator
| Component | Space |
|---|---|
| StreamingMoments | O(1) |
| ReservoirSampler | O(s) |
| KMV sketch | O(k) |
| ExtremeTracker | O(k) |
| MisraGries | O(k) |
| Histogram bins | O(b) |
| Total | O(s + k + b) |
With defaults: s=20,000, k=2,048, b=25
CategoricalAccumulator
| Component | Space |
|---|---|
| KMV sketch (original) | O(k) |
| KMV sketch (lowercase) | O(k) |
| KMV sketch (trimmed) | O(k) |
| MisraGries | O(k) |
| String length tracking | O(1) |
| Total | O(k) |
DatetimeAccumulator
Fixed-size counters: hour (24), weekday (7), month (12), plus min/max tracking.
Total: O(1)
BooleanAccumulator
Two counters (true, false) plus missing count.
Total: O(1)
Configuration Impact
| Parameter | Effect on memory | Effect on accuracy |
|---|---|---|
chunk_size ↑ |
More rows in memory per iteration | No effect on final accuracy |
numeric_sample_size ↑ |
Larger reservoir per numeric column | Better quantile estimates |
uniques_sketch_size ↑ |
Larger KMV sketch per column | Better distinct count accuracy |
top_k_size ↑ |
More frequent values tracked | More comprehensive top-k |
compute_correlations off |
Saves O(p²) | No correlations reported |
Performance Optimizations (v0.0.14)
Several hot paths were vectorized to reduce per-value Python overhead:
| Optimization | Change |
|---|---|
| KMV/MisraGries updates | Per-value add() → batch add_many() with pre-counted values |
| Missing/Inf counting | Python loop → vectorized np.isnan() / np.isinf() |
| Row hashing (duplicates) | Per-row tuple + hash() → vectorized polynomial hash via numpy |
| Boolean coercion | Per-value str().lower() → pandas vectorized str.strip().str.lower() |
| Correlation finite masks | Recomputed per pair → pre-computed once per column, reused |
References
- Welford, B.P. (1962), "Note on a Method for Calculating Corrected Sums of Squares and Products", Technometrics, 4(3): 419–420
- Pébay, P. (2008), "Formulas for Robust, One-Pass Parallel Computation of Covariances and Arbitrary-Order Statistical Moments", Sandia Report SAND2008-6212
- Bar-Yossef, Z. et al. (2002), "Counting Distinct Elements in a Data Stream", RANDOM
- Misra, J. & Gries, D. (1982), "Finding repeated elements", Science of Computer Programming, 2(2): 143–152
- Vitter, J.S. (1985), "Random Sampling with a Reservoir", ACM TOMS, 11(1): 37–57