PySuricata
Exploratory data analysis for Python, built on streaming algorithms.
PySuricata generates self-contained HTML reports for pandas and polars DataFrames. It processes data in chunks using streaming algorithms, so memory usage stays bounded regardless of dataset size.
-
Quick Start
Install PySuricata and generate your first report.
-
Why PySuricata?
Understand the streaming architecture and design decisions.
-
User Guide
Detailed guides for configuration, advanced features, and more.
-
API Reference
Full API documentation generated from source code.
Features
- Streaming processing — Data is processed in configurable chunks, keeping memory usage bounded. Useful for datasets that don't fit in RAM.
- Mathematically grounded — Uses Welford's algorithm for numerically stable moments, Pébay's formulas for mergeable statistics, KMV sketches for distinct count estimation, and Misra-Gries for heavy hitters.
- Pandas and Polars support — Works natively with both
pandas.DataFrameandpolars.DataFrame/polars.LazyFrame. - Self-contained reports — Generates a single HTML file with inline CSS, JS, and SVG charts. No external assets or dependencies needed to view.
- Configurable — Control chunk sizes, sample sizes, sketch parameters, correlation thresholds, and rendering options via
ReportConfig. - Reproducible — Seeded random sampling produces deterministic results across runs.
Installation
This installs PySuricata along with its dependencies: pandas, numpy (on Python ≥3.13), markdown, and psutil.
To also install polars support:
Quick Example
import pandas as pd
from pysuricata import profile
# Load Titanic dataset
url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
df = pd.read_csv(url)
# Generate report
report = profile(df)
report.save_html("titanic_report.html")
This is the actual report generated from the code above (Titanic dataset, 891 rows × 12 columns):
Can't see the report? Open in new tab →
How It Works
PySuricata reads data in chunks and updates lightweight accumulators for each column. This means:
| Aspect | Approach |
|---|---|
| Memory | Bounded by chunk size + accumulator state, not dataset size |
| Speed | Single pass over the data — each row is read once |
| Accuracy | Exact for moments (mean, variance, skewness, kurtosis); approximate with known error bounds for distinct counts and top-k |
| Mergeability | Accumulators can be merged across chunks or machines |
Reports include per-column statistics, histograms, correlation chips, missing value analysis, outlier detection, and more — all computed during the single streaming pass.
Next Steps
-
New to PySuricata?
Start with the Quick Start Guide
-
Want specific examples?
Check the Examples Gallery
-
Interested in the algorithms?
Explore Statistical Methods
-
Want to contribute?
Read the Contributing Guide
Community & Support
License
MIT License. See LICENSE for details.