Categorical Variable Analysis
This page provides comprehensive documentation for how PySuricata analyzes categorical (string, object) variables using scalable streaming algorithms with mathematical guarantees.
Overview
PySuricata treats a categorical variable as any column with string-like values, objects, or low-cardinality integers. Analysis focuses on frequency distributions, diversity metrics, and string characteristics.
Key Features
- Top-k values with frequencies (Misra-Gries algorithm)
- Distinct count estimation (KMV sketch)
- Diversity metrics (entropy, Gini impurity, concentration)
- String statistics (length distribution, empty strings)
- Variant detection (case-insensitive, trimmed)
- Memory-efficient streaming algorithms
Summary Statistics Provided
For each categorical column:
- Count: non-null values, missing percentage
- Distinct: unique value count (exact or approximate)
- Top values: most frequent values with counts and percentages
- Entropy: Shannon entropy (information content)
- Gini impurity: concentration measure
- Diversity ratio: uniqueness measure
- Most common ratio: dominance of top value
- String length: mean, p90, distribution
- Special values: empty strings, whitespace-only
- Variants: case-insensitive and trimmed unique counts
Mathematical Definitions
Frequency and Probability
Let \(x_1, x_2, \ldots, x_n\) be the non-missing categorical values.
Frequency of value \(v\):
Relative frequency (empirical probability):
Distinct count:
Shannon Entropy
Measures the information content or uncertainty in the distribution:
where the sum is over all distinct values \(v\) with \(p(v) > 0\).
Properties: - \(H(X) = 0\) if one value has \(p=1\) (no uncertainty) - \(H(X) = \log_2 d\) if all values equally likely (maximum entropy) - Units: bits of information
Interpretation: - Low entropy (< 1 bit): highly concentrated, predictable - Medium entropy (1-3 bits): moderate diversity - High entropy (> 3 bits): high diversity, hard to predict
Example: - Uniform distribution over 8 values: \(H = \log_2 8 = 3\) bits - One value 90%, others 10%/9: \(H \approx 0.57\) bits
Gini Impurity
Measures the probability of misclassification if labels were assigned randomly according to the distribution:
Properties: - \(\text{Gini}(X) = 0\) if one value (no impurity) - \(\text{Gini}(X) = 1 - 1/d\) if uniform over \(d\) values - Range: \([0, 1)\)
Interpretation: - Low Gini (< 0.2): concentrated distribution - Medium Gini (0.2-0.6): moderate spread - High Gini (> 0.6): high diversity
Use in ML: Decision trees use Gini impurity for splitting criteria.
Diversity Ratio
Simple measure of uniqueness:
where \(d\) = distinct count, \(n\) = total count.
Interpretation: - \(D \to 0\): low diversity (many repeats) - \(D \to 1\): high diversity (mostly unique)
Special cases: - \(D = 1\): all values unique (e.g., primary keys) - \(D = 1/n\): all values identical
Concentration Ratio
Fraction of observations in the top \(k\) values:
where \(v_1, v_2, \ldots\) are values sorted by frequency (descending).
Example: \(CR_5 = 0.80\) means top 5 values account for 80% of data.
Interpretation: - High \(CR_k\): distribution dominated by few values - Low \(CR_k\): distribution spread across many values
Most Common Ratio
Dominance of the single most frequent value:
Interpretation: - MCR > 0.9: highly dominant category (nearly constant) - MCR < 0.1: no dominant category (high diversity)
Streaming Algorithms
Misra-Gries Algorithm for Top-K
Finds the \(k\) most frequent items in a stream using O(k) space with frequency guarantees.
Algorithm:
- Initialize: empty dictionary \(M\) (max size \(k\))
- For each value \(v\):
- If \(v \in M\): increment \(M[v]\)
- Else if \(|M| < k\): add \(M[v] = 1\)
- Else: decrement all counts in \(M\); remove zeros
- Output: items in \(M\) with estimated counts
Guarantee: For any value \(v\) with true frequency \(f(v)\): - If \(f(v) > n/k\), then \(v\) is in output - Estimated frequency within \(n/k\) of true frequency
Space complexity: \(O(k)\)
Update complexity: \(O(k)\) worst-case, \(O(1)\) amortized
Mergeable: Yes (sum counters from multiple streams)
Example: \(k=50\), \(n=1,000,000\) - Guaranteed to find all items with frequency > 20,000 - Frequency estimates within ±20,000
Reference: Misra, J., Gries, D. (1982), "Finding repeated elements", Science of Computer Programming, 2(2): 143–152.
KMV Sketch for Distinct Count
Estimates cardinality using \(k\) minimum hash values.
Algorithm:
- Initialize: empty set \(S\) (max size \(k\))
- For each value \(v\):
- Compute hash \(h(v) \in [0,1]\)
- If \(|S| < k\) or \(h(v) < \max(S)\):
- Add \(h(v)\) to \(S\)
- If \(|S| > k\): remove \(\max(S)\)
- Estimate:
where \(x_k = \max(S)\) is the \(k\)-th smallest hash.
Error bound:
Space: \(O(k)\)
Update: \(O(\log k)\) (heap operations)
Mergeable: Yes (union of sets)
Example: \(k=2048\) - Error: ~2.2% (95% confidence) - Space: ~16 KB (assuming 64-bit hashes)
Reference: Bar-Yossef, Z. et al. (2002), "Counting Distinct Elements in a Data Stream", RANDOM.
Space-Saving Algorithm (Alternative)
Maintains top-k with guaranteed error bounds:
Error bound: Estimated frequency within \(\epsilon n\) where \(\epsilon = 1/k\)
Advantage over Misra-Gries: Tighter worst-case bounds, better for skewed distributions.
Reference: Metwally, A., Agrawal, D., El Abbadi, A. (2005), "Efficient Computation of Frequent and Top-k Elements in Data Streams", ICDT.
String Analysis
Length Statistics
For string values, track:
Mean length:
where \(|x_i|\) is the character count of string \(x_i\).
P90 length: 90th percentile of lengths (via reservoir sampling)
Length distribution: Histogram of string lengths
Use cases: - Detect outliers (abnormally long strings) - Validate constraints (max length) - Estimate storage requirements
Empty Strings
Count of strings that are:
- Empty: ""
- Whitespace-only: match /^\s*$/
- NULL vs. empty distinction
Formula:
Case Variants
Estimate distinct count after case normalization:
Interpretation: - \(d_{\text{lower}} < d\): case variants present (e.g., "USA", "usa") - \(d_{\text{lower}} = d\): no case variants
Trim Variants
Estimate distinct count after removing leading/trailing whitespace:
Interpretation: - \(d_{\text{trim}} < d\): whitespace variants present - \(d_{\text{trim}} = d\): no trim variants
Chi-Square Uniformity Test
Test if the distribution is uniform (all categories equally likely).
Null hypothesis: \(p(v_1) = p(v_2) = \cdots = p(v_d) = 1/d\)
Test statistic:
where \(E = n/d\) is the expected frequency under uniformity.
Distribution under \(H_0\): \(\chi^2_{d-1}\) (chi-square with \(d-1\) degrees of freedom)
P-value: \(P(\chi^2_{d-1} > \chi^2_{\text{obs}})\)
Interpretation: - Small p-value (< 0.05): reject uniformity (distribution is skewed) - Large p-value: consistent with uniform distribution
Not implemented in current version
Chi-square test is planned for future release.
Cardinality Categories
Classify categorical variables by distinct count:
Category | Distinct Count | Examples |
---|---|---|
Boolean-like | 2-3 | Yes/No, True/False/Unknown |
Low cardinality | 4-20 | Status codes, categories |
Medium cardinality | 21-100 | US states, countries |
High cardinality | 101-10,000 | Zip codes, product IDs |
Very high cardinality | > 10,000 | User IDs, URLs, emails |
Recommended actions: - Low cardinality: Show all values in report - High cardinality: Show top-k only, estimate distinct - Very high cardinality: Consider as identifier (unique key)
Computational Complexity
Operation | Time | Space | Notes |
---|---|---|---|
Misra-Gries | \(O(nk)\) worst, \(O(n)\) amortized | \(O(k)\) | Top-k values |
KMV distinct | \(O(n \log k)\) | \(O(k)\) | Distinct count |
Entropy | \(O(n + d)\) | \(O(d)\) | From frequency table |
String lengths | \(O(n)\) | \(O(k)\) | Reservoir sample |
Exact distinct | \(O(n)\) | \(O(d)\) | Hash set |
Configuration
Control categorical analysis via ReportConfig
:
from pysuricata import profile, ReportConfig
config = ReportConfig()
# Top-k size (Misra-Gries)
config.compute.top_k_size = 50 # Default
# Distinct count sketch size (KMV)
config.compute.uniques_sketch_size = 2_048 # Default
# String length sample size
# (Not separately configurable, uses numeric_sample_size)
# Enable/disable case variants
# (Always enabled, no toggle)
# Enable/disable trim variants
# (Always enabled, no toggle)
report = profile(df, config=config)
Implementation Details
CategoricalAccumulator Class
class CategoricalAccumulator:
def __init__(self, name: str, config: CategoricalConfig):
self.name = name
self.count = 0
self.missing = 0
# Top-k values
self._topk = MisraGries(config.top_k_size)
# Distinct count
self._uniques = KMV(config.uniques_sketch_size)
self._uniques_lower = KMV(config.uniques_sketch_size) # Case-insensitive
self._uniques_strip = KMV(config.uniques_sketch_size) # Trimmed
# String lengths
self._len_sum = 0
self._len_n = 0
self._len_sample = ReservoirSampler(config.length_sample_size)
# Special values
self._empty_count = 0
def update(self, values: pd.Series):
"""Update with chunk of values"""
# Filter out missing
# Update top-k
# Update distinct sketches (original, lower, strip)
# Track string lengths
# Count empty strings
pass
def finalize(self) -> CategoricalSummary:
"""Compute final statistics"""
# Get top values from Misra-Gries
# Estimate distinct from KMV
# Compute entropy and Gini
# Compute string length stats
return CategoricalSummary(...)
Validation
PySuricata validates categorical algorithms:
- Exact vs approximate: Compare KMV estimate to exact count (small datasets)
- Top-k correctness: Verify all items with \(f > n/k\) are found
- Entropy bounds: Check \(0 \le H(X) \le \log_2 d\)
- Gini bounds: Check \(0 \le \text{Gini}(X) < 1\)
- Mergeability: Verify merge = concatenate (for Misra-Gries, KMV)
Examples
Basic Usage
import pandas as pd
from pysuricata import profile
df = pd.DataFrame({
"country": ["USA", "UK", "USA", "DE", None, "USA", "FR"]
})
report = profile(df)
report.save_html("report.html")
High-Cardinality Column
# Column with 10,000 unique values
df = pd.DataFrame({
"user_id": [f"user_{i}" for i in range(100_000)]
})
config = ReportConfig()
config.compute.top_k_size = 100 # Show top 100
config.compute.uniques_sketch_size = 4_096 # More accurate distinct
report = profile(df, config=config)
Access Statistics
from pysuricata import summarize
stats = summarize(df)
country_stats = stats["columns"]["country"]
print(f"Distinct: {country_stats['distinct']}")
print(f"Top value: {country_stats['top_values'][0]}")
print(f"Entropy: {country_stats['entropy']:.2f} bits")
print(f"Gini: {country_stats['gini']:.3f}")
Interpreting Results
High Entropy
\(H(X) > 5\) bits suggests: - Many distinct values (> 32) - Fairly uniform distribution - High information content - Possibly high cardinality (consider as identifier)
Low Entropy
\(H(X) < 1\) bit suggests: - Few distinct values (< 4 effective) - Skewed distribution (one value dominates) - Low information content - Consider as low-cardinality categorical
High Gini
\(\text{Gini} > 0.7\) suggests: - Values well-distributed - No single dominant category - Good for stratification
Low Gini
\(\text{Gini} < 0.2\) suggests: - One or few values dominate - Imbalanced distribution - Consider as nearly constant column
Special Cases
All Unique (Primary Key)
- Distinct count \(d = n\)
- Entropy \(H = \log_2 n\) (maximum)
- Diversity ratio \(D = 1.0\)
- Top-k meaningless (all have count 1)
Recommendation: Flag as identifier, exclude from analysis.
Nearly Constant
- Distinct count \(d = 2\) with \(p_1 > 0.99\)
- Entropy \(H < 0.1\) bits
- Gini \(< 0.02\)
Recommendation: Consider removing (low variance).
Many Empty Strings
- Empty count > 10% of non-null
Possible data quality issue: Missing values encoded as empty strings.
References
-
Misra, J., Gries, D. (1982), "Finding repeated elements", Science of Computer Programming, 2(2): 143–152.
-
Bar-Yossef, Z. et al. (2002), "Counting Distinct Elements in a Data Stream", RANDOM.
-
Metwally, A., Agrawal, D., El Abbadi, A. (2005), "Efficient Computation of Frequent and Top-k Elements in Data Streams", ICDT.
-
Shannon, C.E. (1948), "A Mathematical Theory of Communication", Bell System Technical Journal, 27: 379–423.
-
Breiman, L. et al. (1984), Classification and Regression Trees, Wadsworth.
-
Wikipedia: Entropy (information theory) - Link
-
Wikipedia: Decision tree learning - Link
See Also
- Numeric Analysis - Numeric variables
- Sketch Algorithms - KMV, Misra-Gries deep dive
- Data Quality - Quality metrics
- Configuration Guide - All parameters
Last updated: 2025-10-12