PySuricata

Exploratory data analysis for Python, built on streaming algorithms.

PySuricata generates self-contained HTML reports for pandas and polars DataFrames. It processes data in chunks using streaming algorithms, so memory usage stays bounded regardless of dataset size.

Quick Start

Install PySuricata and generate your first report.

Get Started
Why PySuricata?

Understand the streaming architecture and design decisions.

Learn More
User Guide

Detailed guides for configuration, advanced features, and more.

Read the Guide
API Reference

Full API documentation generated from source code.

API Docs

Features

Streaming processing — Data is processed in configurable chunks, keeping memory usage bounded. Useful for datasets that don't fit in RAM.
Mathematically grounded — Uses Welford's algorithm for numerically stable moments, Pébay's formulas for mergeable statistics, KMV sketches for distinct count estimation, and Misra-Gries for heavy hitters.
Pandas and Polars support — Works natively with both pandas.DataFrame and polars.DataFrame / polars.LazyFrame.
Self-contained reports — Generates a single HTML file with inline CSS, JS, and SVG charts. No external assets or dependencies needed to view.
Configurable — Control chunk sizes, sample sizes, sketch parameters, correlation thresholds, and rendering options via ReportConfig.
Reproducible — Seeded random sampling produces deterministic results across runs.

Installation

uv (Recommended)pip

uv add pysuricata

pip install pysuricata

This installs PySuricata along with its dependencies: pandas, numpy (on Python ≥3.13), markdown, and psutil.

To also install polars support:

pip install pysuricata[polars]

Quick Example

import pandas as pd
from pysuricata import profile

# Load Titanic dataset
url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
df = pd.read_csv(url)

# Generate report
report = profile(df)
report.save_html("titanic_report.html")

This is the actual report generated from the code above (Titanic dataset, 891 rows × 12 columns):

Can't see the report? Open in new tab →

How It Works

PySuricata reads data in chunks and updates lightweight accumulators for each column. This means:

Aspect	Approach
Memory	Bounded by chunk size + accumulator state, not dataset size
Speed	Single pass over the data — each row is read once
Accuracy	Exact for moments (mean, variance, skewness, kurtosis); approximate with known error bounds for distinct counts and top-k
Mergeability	Accumulators can be merged across chunks or machines

Reports include per-column statistics, histograms, correlation chips, missing value analysis, outlier detection, and more — all computed during the single streaming pass.

PySuricata

Features

Installation

Quick Example

How It Works

Next Steps

Community & Support

License