Basic Usage
Generating an HTML Report
The simplest way to use PySuricata is to generate an HTML report from a DataFrame:
import pandas as pd
from pysuricata import profile
df = pd.read_csv("data.csv")
report = profile(df)
report.save_html("report.html")
Open report.html in any browser. The file is self-contained — no external assets needed.
Using Polars
PySuricata works natively with polars DataFrames and LazyFrames. Install polars support with:
Then use it the same way:
import polars as pl
from pysuricata import profile
# Eager DataFrame
df = pl.read_csv("data.csv")
report = profile(df)
report.save_html("report.html")
# LazyFrame — PySuricata collects it in chunks internally
lf = pl.scan_csv("large_file.csv")
report = profile(lf)
report.save_html("report.html")
Streaming Large Datasets
For datasets that don't fit in memory, pass a generator yielding DataFrame chunks:
Each chunk is processed and discarded, so memory stays bounded regardless of total dataset size.
Getting Statistics Without HTML
Use summarize() to get a dictionary of statistics without generating an HTML report:
from pysuricata import summarize
stats = summarize(df)
# Dataset-level statistics
print(stats["dataset"])
# {'rows_est': 891, 'cols': 12, 'missing_cells_pct': 8.7, ...}
# Per-column statistics
print(stats["columns"]["age"])
# {'mean': 29.7, 'std': 14.5, 'min': 0.42, 'max': 80.0, ...}
This is useful for CI/CD quality checks:
stats = summarize(df)
assert stats["dataset"]["missing_cells_pct"] < 5.0
assert stats["dataset"]["duplicate_rows_pct_est"] < 1.0
Saving Stats as JSON
Reproducible Reports
Set random_seed to make histogram sampling deterministic across runs:
from pysuricata import profile, ReportConfig
config = ReportConfig()
config.compute.random_seed = 42
report = profile(df, config=config)
# Same report every time with the same data
End-to-End Example
A complete example covering all four column types:
import pandas as pd
from pysuricata import profile, ReportConfig
df = pd.DataFrame({
"amount": [1.0, 2.5, None, 4.0, 5.5],
"country": ["US", "US", "DE", None, "FR"],
"ts": pd.to_datetime(["2021-01-01", "2021-01-02", None, "2021-01-04", "2021-01-05"]),
"flag": [True, False, True, None, False],
})
config = ReportConfig()
config.compute.random_seed = 0
report = profile(df, config=config)
report.save_html("report.html")
This generates a report with:
- amount analyzed as numeric (mean, std, histogram, outliers)
- country analyzed as categorical (top values, distinct count, entropy)
- ts analyzed as datetime (range, day-of-week distribution)
- flag analyzed as boolean (true/false ratio, balance score)
See Also
- Configuration Guide — All available options
- Advanced Features — Streaming from multiple sources, distributed processing
- Examples Gallery — More real-world use cases