Statistical Drift Detection

Baselinr provides advanced statistical drift detection using multiple statistical tests that can be selected and combined based on column type and metric characteristics.

Overview

The statistical drift detection strategy uses rigorous statistical methods to detect changes in data distributions, not just simple threshold-based comparisons. It automatically selects appropriate tests based on whether your data is numeric or categorical.

When to Use Statistical Tests

Statistical tests are ideal when you need:

Distribution-aware detection: Detect changes in data shape, not just mean shifts
Categorical data analysis: Track changes in category distributions
Reduced false positives: Statistical significance testing reduces noise
Multiple perspectives: Combine multiple tests for comprehensive coverage
Histogram data: Leverage histogram information when available

Configuration

Basic Configuration

drift_detection:
  strategy: statistical
  statistical:
    tests:
      - ks_test
      - psi
      - chi_square
    sensitivity: medium

Full Configuration with Test Parameters

drift_detection:
  strategy: statistical
  statistical:
    tests:
      - ks_test          # Kolmogorov-Smirnov test
      - psi              # Population Stability Index
      - z_score          # Z-score test
      - chi_square       # Chi-square test
      - entropy          # Entropy change
      - top_k            # Top-K stability
    sensitivity: medium  # low, medium, or high
    test_params:
      ks_test:
        alpha: 0.05      # Significance level
      psi:
        buckets: 10      # Number of distribution buckets
        threshold: 0.2   # PSI threshold for drift
      z_score:
        z_threshold: 2.0 # Z-score threshold (std devs)
      chi_square:
        alpha: 0.05      # Significance level
      entropy:
        entropy_threshold: 0.1  # Entropy change threshold
      top_k:
        k: 10                    # Number of top categories
        similarity_threshold: 0.7  # Similarity threshold

Available Statistical Tests

Numeric Column Tests

1. Kolmogorov-Smirnov (KS) Test

Test Name: ks_test

Description: Compares the distribution of baseline vs current data. Good for detecting shape changes (skew, multimodality, heavy tails).

Parameters:

alpha: Significance level (default: 0.05)

How it works:

Compares empirical cumulative distribution functions (CDFs)
Returns KS statistic (maximum difference between CDFs)
Calculates p-value for statistical significance
Works best with histogram data, but can approximate from summary statistics

Example:

test_params:
  ks_test:
    alpha: 0.05

Best for: Detecting distribution shape changes in numeric columns

2. Population Stability Index (PSI)

Test Name: psi

Description: Bucket-based drift detection. Good for monitoring slow drifts over long periods.

Parameters:

buckets: Number of buckets for distribution (default: 10)
threshold: PSI threshold for drift detection (default: 0.2)

PSI Score Interpretation:

< 0.1: No significant drift
0.1-0.2: Minor drift
0.2-0.5: Moderate drift
> 0.5: Significant drift

Example:

test_params:
  psi:
    buckets: 20
    threshold: 0.15

Best for: Long-term drift monitoring, especially with histogram data

3. Z-Score / Variance Test

Test Name: z_score

Description: Detects shifts in mean/variance using standard deviation.

Parameters:

z_threshold: Z-score threshold in standard deviations (default: 2.0)

How it works:

Calculates: z = |(current_mean - baseline_mean) / baseline_stddev|
Flags drift if z-score exceeds threshold
Severity based on z-score magnitude

Example:

test_params:
  z_score:
    z_threshold: 2.5  # More sensitive (2.5 std devs)

Best for: Detecting mean shifts when you have stddev information

Categorical Column Tests

4. Chi-Square Test

Test Name: chi_square

Description: Tests whether the distribution of categories has changed significantly.

Parameters:

alpha: Significance level (default: 0.05)

How it works:

Compares observed vs expected category frequencies
Calculates chi-square statistic
Uses p-value for statistical significance

Example:

test_params:
  chi_square:
    alpha: 0.01  # More strict (1% significance)

Best for: Detecting changes in category distributions

5. Entropy Change Test

Test Name: entropy

Description: Detects changes in Shannon entropy (randomness/uniformity) of category distributions.

Parameters:

entropy_threshold: Threshold for entropy change (default: 0.1)

How it works:

Calculates Shannon entropy: H = -Σ(p * log2(p))
Compares baseline vs current entropy
Flags drift if entropy change exceeds threshold

Example:

test_params:
  entropy:
    entropy_threshold: 0.15

Best for: Detecting changes in data uniformity/randomness

6. Top-K Stability Test

Test Name: top_k

Description: Tracks the top-K most frequent categories and detects changes.

Parameters:

k: Number of top categories to track (default: 10)
similarity_threshold: Similarity threshold for stability (default: 0.7)

How it works:

Extracts top-K categories from baseline and current
Calculates Jaccard similarity (intersection / union)
Flags drift if similarity drops below threshold

Example:

test_params:
  top_k:
    k: 20
    similarity_threshold: 0.8  # More strict

Best for: Monitoring stability of most common categories

Sensitivity Levels

The sensitivity parameter adjusts thresholds across all tests:

low: Less sensitive (higher thresholds) - reduces false positives
medium: Balanced (default thresholds) - recommended starting point
high: More sensitive (lower thresholds) - catches more drift, may have more false positives

How it works:

Low sensitivity: thresholds × 1.5
Medium sensitivity: thresholds × 1.0 (default)
High sensitivity: thresholds × 0.5

Test Selection

The statistical strategy automatically selects applicable tests based on:

Column Type: Numeric tests for numeric columns, categorical tests for categorical columns
Metric Type: Tests check if they support the specific metric being compared
Data Availability: Tests that can't run (insufficient data) are skipped gracefully

Automatic Test Selection

# Numeric column with mean metric
# → Runs: ks_test, psi, z_score (if data available)

# Categorical column with distinct_count metric
# → Runs: chi_square, entropy, top_k (if data available)

Data Requirements

Optimal Data

Statistical tests work best with:

Histogram data: For KS test and PSI (enables distribution comparison)
Category distributions: For categorical tests (top values, frequencies)
Summary statistics: Mean, stddev, min, max (for approximations)

Fallback Behavior

If optimal data isn't available:

Tests use approximations from summary statistics
Some tests may skip with a warning
System falls back to threshold-based detection if no tests can run

Enabling Histogram Data

To get the best results from statistical tests, enable histograms in your profiling config:

profiling:
  compute_histograms: true
  histogram_bins: 10  # More bins = more granular distribution

Usage Examples

Example 1: Numeric Columns with Histograms

drift_detection:
  strategy: statistical
  statistical:
    tests:
      - ks_test
      - psi
      - z_score
    sensitivity: medium
    test_params:
      ks_test:
        alpha: 0.05
      psi:
        buckets: 15
        threshold: 0.2

What it detects:

Distribution shape changes (KS test)
Bucket-level shifts (PSI)
Mean shifts (Z-score)

Example 2: Categorical Columns

drift_detection:
  strategy: statistical
  statistical:
    tests:
      - chi_square
      - entropy
      - top_k
    sensitivity: high
    test_params:
      chi_square:
        alpha: 0.01
      top_k:
        k: 15
        similarity_threshold: 0.8

What it detects:

Category distribution changes (Chi-square)
Entropy/uniformity changes (Entropy)
Top category stability (Top-K)

Example 3: Comprehensive Coverage

drift_detection:
  strategy: statistical
  statistical:
    tests:
      - ks_test
      - psi
      - z_score
      - chi_square
      - entropy
      - top_k
    sensitivity: medium

What it detects: All types of drift for both numeric and categorical columns

Understanding Results

Test Result Aggregation

When multiple tests run, results are aggregated:

Drift Detection: Any test detecting drift → overall drift detected
Severity: Maximum severity across all tests
Score: Average score across all tests
Metadata: Detailed results from each test included

Example Output

report = detector.detect_drift("customers")

for drift in report.column_drifts:
    if drift.drift_detected:
        print(f"{drift.column_name}.{drift.metric_name}")
        print(f"  Severity: {drift.drift_severity}")
        print(f"  Tests run: {drift.metadata['test_results']}")
        
        # Individual test results
        for test_result in drift.metadata['test_results']:
            print(f"    {test_result['test']}: score={test_result['score']}, "
                  f"p_value={test_result['p_value']}, "
                  f"drift={test_result['drift_detected']}")

Metadata Structure

drift.metadata = {
    'strategy': 'statistical',
    'tests_run': ['ks_test', 'psi', 'z_score'],
    'test_results': [
        {
            'test': 'ks_test',
            'score': 0.25,
            'p_value': 0.001,
            'drift_detected': True,
            'severity': 'high',
            'metadata': {
                'alpha': 0.05,
                'statistic': 0.25,
                'p_value': 0.001
            }
        },
        # ... more test results
    ],
    'aggregated_score': 0.18,
    'sensitivity': 'medium'
}

Best Practices

1. Start with Default Configuration

drift_detection:
  strategy: statistical
  statistical:
    tests:
      - ks_test
      - psi
      - chi_square
    sensitivity: medium

2. Enable Histograms

For best results with KS test and PSI:

profiling:
  compute_histograms: true
  histogram_bins: 10

3. Adjust Sensitivity Based on Your Needs

# Production: Lower sensitivity (fewer false positives)
sensitivity: low

# Development: Higher sensitivity (catch more issues)
sensitivity: high

4. Select Tests Based on Your Data

# Numeric-heavy dataset
tests:
  - ks_test
  - psi
  - z_score

# Categorical-heavy dataset
tests:
  - chi_square
  - entropy
  - top_k

# Mixed dataset
tests:
  - ks_test
  - psi
  - chi_square
  - top_k

5. Tune Test-Specific Parameters

test_params:
  # More strict KS test
  ks_test:
    alpha: 0.01
  
  # More buckets for finer PSI analysis
  psi:
    buckets: 20
    threshold: 0.15
  
  # Track more top categories
  top_k:
    k: 20
    similarity_threshold: 0.8

Performance Considerations

Multiple tests: Running more tests takes slightly longer, but tests run in parallel where possible
Histogram data: Requires more storage but enables better detection
Large datasets: Statistical tests are efficient and scale well

Troubleshooting

"No statistical tests could run"

Problem: Tests don't support the column type or metric, or insufficient data.

Solutions:

Check column type is numeric or categorical
Enable histograms: compute_histograms: true
Ensure you have summary statistics (mean, stddev, etc.)
System will fallback to threshold-based detection

"All tests fail"

Problem: Data format issues or missing dependencies.

Solutions:

Check data is in expected format (histograms, distributions)
Install scipy for better test accuracy: pip install scipy
Check logs for specific error messages

"Too many false positives"

Problem: Sensitivity too high or thresholds too low.

Solutions:

Lower sensitivity: sensitivity: low
Increase test thresholds in test_params
Remove more sensitive tests (e.g., remove entropy if too noisy)

"Not detecting obvious drift"

Problem: Sensitivity too low or thresholds too high.

Solutions:

Increase sensitivity: sensitivity: high
Lower test thresholds in test_params
Add more tests to the list

Comparison with Other Strategies

Feature	Absolute Threshold	Standard Deviation	Statistical Tests
Ease of Use	⭐⭐⭐⭐⭐	⭐⭐⭐⭐	⭐⭐⭐
Statistical Rigor	⭐⭐	⭐⭐⭐	⭐⭐⭐⭐⭐
Distribution Awareness	❌	❌	✅
Categorical Support	❌	❌	✅
Data Requirements	Minimal	Summary stats	Histograms preferred
False Positives	Medium	Low	Very Low
Setup Complexity	Low	Medium	Medium

Overview​

When to Use Statistical Tests​

Configuration​

Basic Configuration​

Full Configuration with Test Parameters​

Available Statistical Tests​

Numeric Column Tests​

1. Kolmogorov-Smirnov (KS) Test​

2. Population Stability Index (PSI)​

3. Z-Score / Variance Test​

Categorical Column Tests​

4. Chi-Square Test​

5. Entropy Change Test​

6. Top-K Stability Test​

Sensitivity Levels​

Test Selection​

Automatic Test Selection​

Data Requirements​

Optimal Data​

Fallback Behavior​

Enabling Histogram Data​

Usage Examples​

Example 1: Numeric Columns with Histograms​

Example 2: Categorical Columns​

Example 3: Comprehensive Coverage​

Understanding Results​

Test Result Aggregation​

Example Output​

Metadata Structure​

Best Practices​

1. Start with Default Configuration​

2. Enable Histograms​

3. Adjust Sensitivity Based on Your Needs​

4. Select Tests Based on Your Data​

5. Tune Test-Specific Parameters​

Performance Considerations​

Troubleshooting​

"No statistical tests could run"​

"All tests fail"​

"Too many false positives"​

"Not detecting obvious drift"​

Comparison with Other Strategies​

See Also​

Overview

When to Use Statistical Tests

Configuration

Basic Configuration

Full Configuration with Test Parameters

Available Statistical Tests

Numeric Column Tests

1. Kolmogorov-Smirnov (KS) Test

2. Population Stability Index (PSI)

3. Z-Score / Variance Test

Categorical Column Tests

4. Chi-Square Test

5. Entropy Change Test

6. Top-K Stability Test

Sensitivity Levels

Test Selection

Automatic Test Selection

Data Requirements

Optimal Data

Fallback Behavior

Enabling Histogram Data

Usage Examples

Example 1: Numeric Columns with Histograms

Example 2: Categorical Columns

Example 3: Comprehensive Coverage

Understanding Results

Test Result Aggregation

Example Output

Metadata Structure

Best Practices

1. Start with Default Configuration

2. Enable Histograms

3. Adjust Sensitivity Based on Your Needs

4. Select Tests Based on Your Data

5. Tune Test-Specific Parameters

Performance Considerations

Troubleshooting

"No statistical tests could run"

"All tests fail"

"Too many false positives"

"Not detecting obvious drift"

Comparison with Other Strategies

See Also