Statistical Methods Overview
PySuricata analyzes four variable types with specialized algorithms for each.
Analysis by Variable Type
Numeric Variables
Exact statistics using Welford/Pébay streaming algorithms: - Mean, variance, standard deviation - Skewness, kurtosis - Min, max, range
Approximate statistics using probabilistic data structures: - Quantiles (reservoir sampling) - Distinct count (KMV sketch) - Histograms (adaptive binning)
Key formulas: [ \bar{x} = \frac{1}{n}\sum x_i, \quad s^2 = \frac{1}{n-1}\sum (x_i - \bar{x})^2 ]
Categorical Variables
Analysis includes: - Top-k values (Misra-Gries algorithm) - Distinct count (KMV sketch) - Entropy and Gini impurity - String statistics
Key formulas: [ H(X) = -\sum p(x) \log_2 p(x), \quad \text{Gini}(X) = 1 - \sum p(x)^2 ]
DateTime Variables
Temporal analysis: - Hour, day-of-week, month distributions - Monotonicity detection - Timeline visualizations
Key formulas: [ \Delta t = \max(t) - \min(t), \quad M = \frac{n_{\text{increasing}}}{n-1} ]
Boolean Variables
Binary analysis: - True/false counts and ratios - Entropy calculation - Imbalance detection
Key formulas: [ H = -p \log_2(p) - (1-p)\log_2(1-p) ]
Advanced Analytics
Correlations
Streaming Pearson correlation between numeric columns.
Missing Values
Intelligent missing data analysis with chunk-level tracking.
Algorithms
Streaming Statistics
- Welford's algorithm: Online mean/variance
- Pébay's formulas: Parallel merging
Sketch Algorithms
- KMV: Distinct count estimation
- Misra-Gries: Top-k heavy hitters
- Reservoir sampling: Uniform sampling
Guarantees
Method | Type | Error |
---|---|---|
Mean, variance | Exact | Machine precision |
Skewness, kurtosis | Exact | Machine precision |
Distinct (KMV) | Approximate | ~2% (k=2048) |
Top-k (Misra-Gries) | Guarantee | All freq > n/k found |
Quantiles (reservoir) | Exact | From uniform sample |
See Also
- Numeric Analysis - Complete numeric documentation
- Categorical Analysis - Categorical methods
- DateTime Analysis - Temporal analysis
- Boolean Analysis - Binary variables
- Streaming Algorithms - Algorithm details
- Sketch Algorithms - Probabilistic structures
Last updated: 2025-10-12