Skip to content
PySuricata Logo

Build Status PyPI version Python versions License: MIT codecov Documentation Downloads

PySuricata

Exploratory data analysis for Python, built on streaming algorithms.

PySuricata generates self-contained HTML reports for pandas and polars DataFrames. It processes data in chunks using streaming algorithms, so memory usage stays bounded regardless of dataset size.

  • Quick Start


    Install PySuricata and generate your first report.

    Get Started

  • Why PySuricata?


    Understand the streaming architecture and design decisions.

    Learn More

  • User Guide


    Detailed guides for configuration, advanced features, and more.

    Read the Guide

  • API Reference


    Full API documentation generated from source code.

    API Docs

Features

  • Streaming processing — Data is processed in configurable chunks, keeping memory usage bounded. Useful for datasets that don't fit in RAM.
  • Mathematically grounded — Uses Welford's algorithm for numerically stable moments, Pébay's formulas for mergeable statistics, KMV sketches for distinct count estimation, and Misra-Gries for heavy hitters.
  • Pandas and Polars support — Works natively with both pandas.DataFrame and polars.DataFrame / polars.LazyFrame.
  • Self-contained reports — Generates a single HTML file with inline CSS, JS, and SVG charts. No external assets or dependencies needed to view.
  • Configurable — Control chunk sizes, sample sizes, sketch parameters, correlation thresholds, and rendering options via ReportConfig.
  • Reproducible — Seeded random sampling produces deterministic results across runs.

Installation

uv add pysuricata
pip install pysuricata

This installs PySuricata along with its dependencies: pandas, numpy (on Python ≥3.13), markdown, and psutil.

To also install polars support:

pip install pysuricata[polars]

Quick Example

import pandas as pd
from pysuricata import profile

# Load Titanic dataset
url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
df = pd.read_csv(url)

# Generate report
report = profile(df)
report.save_html("titanic_report.html")

This is the actual report generated from the code above (Titanic dataset, 891 rows × 12 columns):

Can't see the report? Open in new tab →

How It Works

PySuricata reads data in chunks and updates lightweight accumulators for each column. This means:

Aspect Approach
Memory Bounded by chunk size + accumulator state, not dataset size
Speed Single pass over the data — each row is read once
Accuracy Exact for moments (mean, variance, skewness, kurtosis); approximate with known error bounds for distinct counts and top-k
Mergeability Accumulators can be merged across chunks or machines

Reports include per-column statistics, histograms, correlation chips, missing value analysis, outlier detection, and more — all computed during the single streaming pass.

Next Steps

Community & Support

License

MIT License. See LICENSE for details.