Frequently Asked Questions

General Questions

What is PySuricata?

PySuricata is a lightweight Python library for exploratory data analysis (EDA) that generates self-contained HTML reports using streaming algorithms for memory efficiency.

When should I use PySuricata vs pandas-profiling?

Use PySuricata when: - Dataset > 1 GB (memory-constrained) - Need streaming/bounded memory - Want minimal dependencies - Need reproducible reports (seeded sampling) - Working with polars

Use pandas-profiling when: - Dataset < 100 MB - Need interactive widgets - Want correlation heatmaps - Don't mind heavy dependencies

Is PySuricata production-ready?

Yes! PySuricata is actively maintained with: - 90%+ test coverage - CI/CD pipeline - Semantic versioning - Regular releases on PyPI

Installation & Setup

How do I install PySuricata?

pip install pysuricata

What are the dependencies?

Core dependencies: - pandas (or polars) - markdown - Python 3.9+

Optional: - polars (for polars DataFrames)

Why is my installation failing?

Common issues: 1. Python version: Requires 3.9+

python --version  # Check version

Conflicting packages: Try fresh virtual environment

python -m venv venv
source venv/bin/activate  # or venv\Scripts\activate on Windows
pip install pysuricata

Usage Questions

How do I generate a report?

from pysuricata import profile

report = profile(df)
report.save_html("report.html")

Can I profile only specific columns?

Yes:

from pysuricata import profile, ReportConfig

config = ReportConfig()
config.compute.columns = ["col1", "col2", "col3"]

report = profile(df, config=config)

How do I make reports reproducible?

Set random seed:

config = ReportConfig()
config.compute.random_seed = 42

report = profile(df, config=config)

Can I get statistics without HTML?

Yes, use summarize():

from pysuricata import summarize

stats = summarize(df)
print(stats["dataset"])
print(stats["columns"]["my_column"])

Performance Questions

How much memory does PySuricata use?

Approximately: - Base overhead: ~50 MB - Per numeric column: ~160 KB (default sample_size=20K) - Per categorical column: ~100 KB - Independent of dataset size (streaming)

My report is slow. How can I speed it up?

Quick wins: 1. Disable correlations:

config.compute.compute_correlations = False

Reduce sample sizes:

config.compute.numeric_sample_size = 10_000  # Default: 20_000

Increase chunk size:

config.compute.chunk_size = 500_000  # Default: 200_000

Can PySuricata handle 1 TB datasets?

Yes, with streaming:

def read_large_dataset():
    for file in large_files:
        yield pd.read_parquet(file)

report = profile(read_large_dataset())

Memory usage stays constant regardless of dataset size.

Why are correlations slow?

Correlation computation is O(p²) where p = number of numeric columns.

Solutions: - Disable for > 100 columns - Increase threshold to show fewer correlations - Profile fewer columns

Technical Questions

Are the statistics exact or approximate?

Exact: - Mean, variance, skewness, kurtosis (Welford/Pébay) - Min, max, count - Quantiles (if dataset fits in sample)

Approximate: - Distinct count (KMV sketch, ~2% error with k=2048) - Top-k values (Misra-Gries, guaranteed for freq > n/k) - Quantiles for huge datasets (from reservoir sample)

How accurate are the approximations?

Distinct count (KMV): - k=1024: ~3% error - k=2048: ~2% error (default) - k=4096: ~1.5% error

Top-k (Misra-Gries): - Guaranteed to find all items with frequency > n/k - Frequency estimates within ±n/k

What algorithms does PySuricata use?

Moments: Welford's online algorithm, Pébay's parallel merge
Distinct count: K-Minimum Values (KMV) sketch
Top-k: Misra-Gries algorithm
Quantiles: Reservoir sampling (exact uniform sample)
Correlations: Streaming Pearson correlation

See Algorithms for details.

Does PySuricata support distributed computing?

Partially: - Accumulators are mergeable: Can run on multiple machines and merge results - No built-in distribution: Must use external framework (Spark, Dask)

Example with manual merge:

# Machine 1
acc1 = NumericAccumulator("col")
acc1.update(data_part1)

# Machine 2
acc2 = NumericAccumulator("col")
acc2.update(data_part2)

# Merge
acc1.merge(acc2)
final_stats = acc1.finalize()

Data Questions

Does PySuricata modify my data?

No. PySuricata only reads data, never modifies it.

What data formats are supported?

Anything that can be loaded into pandas or polars: - CSV, Parquet, JSON, Excel - SQL databases (via pandas read_sql) - APIs (via pandas read_json)

Just load into DataFrame first:

df = pd.read_csv("data.csv")
report = profile(df)

Can I profile streaming data (Kafka, etc.)?

Yes, if you can iterate through chunks:

def read_from_kafka():
    consumer = KafkaConsumer(...)
    chunk = []
    for message in consumer:
        chunk.append(parse(message))
        if len(chunk) >= 10_000:
            yield pd.DataFrame(chunk)
            chunk = []

report = profile(read_from_kafka())

How does PySuricata handle missing values?

Excluded from calculations: Missing values don't affect mean, variance, etc.
Reported separately: Missing count and percentage shown
Visualization: Missing data distribution per chunk

See Missing Values.

What about duplicate rows?

PySuricata estimates duplicate percentage using KMV sketch (approximate). For exact duplicates:

exact_duplicates = df.duplicated().sum()
dup_pct = (exact_duplicates / len(df)) * 100

Report Questions

Why is my HTML report so large?

Possible reasons: 1. Many columns: Each variable card adds HTML 2. Large sample: Reduce sample_rows 3. Many top values: Reduce top_k_size

Typical sizes: - 10 columns: ~500 KB - 50 columns: ~2 MB - 100 columns: ~4 MB

Can I customize the report appearance?

Not directly. The report uses inline CSS for portability.

Workaround: Modify the template in pysuricata/templates/report_template.html (advanced).

Can I export to PDF?

Not built-in. Options: 1. Print HTML to PDF in browser 2. Use tool like wkhtmltopdf:

wkhtmltopdf report.html report.pdf

How do I display reports in Jupyter?

report = profile(df)
report  # Auto-displays

# Or with custom size
report.display_in_notebook(height="800px")

Error Messages

"Out of memory" error

Reduce memory usage:

config = ReportConfig()
config.compute.chunk_size = 50_000  # Smaller chunks
config.compute.numeric_sample_size = 5_000  # Smaller samples
config.compute.compute_correlations = False  # Skip correlations

"Cannot infer type" error

Some columns may have mixed types. Clean data first:

# Convert to consistent type
df["mixed_col"] = df["mixed_col"].astype(str)

# Or drop problematic columns
df = df.drop(columns=["problematic_col"])

"Build failed" (documentation)

If contributing and docs build fails:

# Check mkdocs.yml syntax
uv run python -c "import yaml; yaml.safe_load(open('mkdocs.yml'))"

# Verify all files exist
ls docs/  # Check file names match mkdocs.yml

# Build with verbose output
uv run mkdocs build --verbose

Contributing Questions

How can I contribute?

See Contributing Guide.

Ways to contribute: - Report bugs - Suggest features - Improve documentation - Submit pull requests - Help others in Discussions

Where do I report bugs?

GitHub Issues

Include: - Python version - PySuricata version - Minimal reproducible example - Error message/traceback

Where can I get help?

💬 GitHub Discussions
🐛 GitHub Issues
📧 Email: alvarodiez20@gmail.com

Still have questions?

Ask in GitHub Discussions or open an issue.

Last updated: 2025-10-12