Expectation Learning in Baselinr
Baselinr can automatically learn expected metric ranges from historical profiling data, enabling automatic outlier detection without requiring explicit thresholds.
Overview
Expectation Learning is a feature that automatically computes expected statistical ranges for metrics based on historical profiling runs. This complements the existing baseline system by providing pre-computed statistical models that can detect anomalies.
Expectations vs Baselines
Baselines (existing system):
- Selected dynamically during drift detection (e.g., "last run", "moving average")
- Single value or simple aggregation used as a reference point
- Purpose: answer "is the current value different from the baseline?"
- Computed on-demand based on drift detection strategy
- Used for detecting changes over time
Learned Expectations (new system):
- Pre-computed and persistently stored statistical models
- Rich statistical properties: expected mean/variance, control limits, learned distributions, categorical frequencies
- Purpose: answer "is this value within the expected normal range?"
- Continuously updated as new profiling runs complete
- Used for automatic outlier detection without explicit thresholds
- Independent of drift detection configuration
Key Difference: Baselines help detect changes ("this value changed from last week"), while expectations help detect anomalies ("this value is outside the 3-sigma normal range"). They complement each other - a value can both drift from baseline AND be within expected range, or vice versa.
Configuration
Expectation learning is opt-in and disabled by default. Enable it in your storage configuration:
storage:
connection:
type: postgres
host: localhost
database: baselinr_db
# Expectation learning configuration
enable_expectation_learning: true
learning_window_days: 30 # Learn from last 30 days (default: 30)
min_samples: 5 # Require at least 5 runs (default: 5)
ewma_lambda: 0.2 # EWMA smoothing parameter (default: 0.2)
Configuration Options
-
enable_expectation_learning(bool, default:false)- Enable automatic learning of expected metric ranges
- Set to
trueto enable expectation learning
-
learning_window_days(int, default:30)- Historical window in days for learning expectations
- Only profiling runs within this window are used for learning
- Longer windows provide more stable expectations but may include outdated patterns
- Shorter windows adapt faster to changes but may be less reliable
-
min_samples(int, default:5)- Minimum number of historical runs required before learning expectations
- If fewer runs are available, expectations will not be learned for that metric
- Lower values allow learning with less history but may be less reliable
- Recommended: 5-10 for stable expectations
-
ewma_lambda(float, default:0.2)- Exponentially Weighted Moving Average smoothing parameter
- Range: 0 < lambda ≤ 1
- Lower values (e.g., 0.1) give more weight to older data (smoother)
- Higher values (e.g., 0.3) give more weight to recent data (more reactive)
- Used for computing EWMA-based expectations
How Expectations are Learned
After each profiling run completes, if expectation learning is enabled, Baselinr will:
- Query Historical Data: Retrieve metric values from previous profiling runs within the configured window
- Check Sample Size: Ensure sufficient samples are available (>=
min_samples) - Compute Statistics: Calculate expected mean, variance, standard deviation, min, max
- Compute Control Limits: Calculate Shewhart 3-sigma control limits (mean ± 3σ)
- Learn Distribution: Detect if values follow a normal distribution (heuristic-based)
- Learn Categorical Frequencies: For categorical columns, learn expected frequency distributions
- Store Expectations: Save learned expectations to the
baselinr_expectationstable
Supported Metrics
Expectations are learned for numeric metrics:
mean- Expected mean valuestddev- Expected standard deviationcount- Expected row countnull_ratio- Expected null percentageunique_ratio- Expected uniqueness ratio
What Gets Learned
For each metric, the following information is stored:
-
Expected Statistics:
- Mean, variance, standard deviation
- Min and max observed values
-
Control Limits:
- Lower Control Limit (LCL) and Upper Control Limit (UCL)
- Typically computed using Shewhart 3-sigma method: mean ± 3 × stddev
-
Distribution Information:
- Distribution type (normal, empirical)
- Distribution parameters (mean, stddev, etc.)
-
Categorical Distributions (for categorical columns):
- Expected frequency for each category value
-
EWMA (if sufficient samples):
- Exponentially Weighted Moving Average value
Automatic Updates
Expectations are automatically updated after each profiling run if:
- Expectation learning is enabled
- Sufficient historical data is available (>=
min_samples) - The metric exists in the current profiling run
Expectations are recalculated from scratch using all available historical data within the window, ensuring they stay current with data patterns.
Using Expectations for Outlier Detection
(Future feature - expectations are currently learned but not yet used for automatic outlier detection)
In future versions, learned expectations will be used to automatically flag outliers without requiring explicit thresholds. For example:
- Values outside the 3-sigma control limits
- Categorical values with unexpected frequencies
- Values that don't match the learned distribution
Troubleshooting
Expectations Not Being Learned
Symptom: No expectations appear in the database after profiling runs.
Possible Causes:
- Learning disabled: Check that
enable_expectation_learning: truein storage config - Insufficient samples: Not enough historical runs (need >=
min_samples)- Solution: Wait for more profiling runs, or reduce
min_samples
- Solution: Wait for more profiling runs, or reduce
- Window too short: Historical window doesn't contain enough runs
- Solution: Increase
learning_window_days
- Solution: Increase
Check logs: Look for debug messages like:
Insufficient samples for table.column.metric: 3 < 5
Expectations Not Updating
Symptom: Expectations exist but don't change after new profiling runs.
Possible Causes:
- Learning disabled: Check configuration
- Errors during learning: Check logs for warnings
- Metric not in current run: If a metric doesn't appear in the current run, expectations won't update
Check logs: Look for warnings like:
Failed to learn expectations for table.column.metric: ...
Control Limits Seem Wrong
Symptom: Control limits (LCL/UCL) don't match expected ranges.
Possible Causes:
- High variance in historical data: Control limits are computed as mean ± 3σ
- High variance = wider limits
- This may be correct if your data is naturally variable
- Limited historical data: With few samples, statistics may be unreliable
- Solution: Ensure >= 10 samples for more reliable limits
- Non-normal distribution: Shewhart limits assume normal distribution
- Baselinr detects distribution type, but limits are still computed using standard deviation
Recommendation: Review the expected_stddev and distribution_type in expectations to understand the data characteristics.
Example Configuration
Basic Setup
storage:
connection:
type: postgres
host: localhost
database: baselinr_db
username: baselinr
password: secret
enable_expectation_learning: true
Advanced Setup
storage:
connection:
type: snowflake
account: myaccount
database: BASELINR_DB
warehouse: COMPUTE_WH
# Expectation learning with custom parameters
enable_expectation_learning: true
learning_window_days: 60 # Use 60 days of history
min_samples: 10 # Require 10 runs before learning
ewma_lambda: 0.15 # More conservative smoothing
Per-Table Configuration
(Note: Currently learning applies to all tables. Per-table configuration may be added in future versions.)
Database Schema
Expectations are stored in the baselinr_expectations table. Key fields:
table_name,schema_name,column_name,metric_name- Identifiersexpected_mean,expected_stddev,expected_min,expected_max- Statisticslower_control_limit,upper_control_limit- Control limitsdistribution_type,distribution_params- Distribution informationcategory_distribution- Categorical frequencies (JSON)sample_size,learning_window_days- Metadatalast_updated- Last update timestamp
Query expectations:
SELECT * FROM baselinr_expectations
WHERE table_name = 'users'
AND column_name = 'age'
AND metric_name = 'mean';
Migration
To enable expectation learning on an existing Baselinr installation:
-
Run migration to create the expectations table:
baselinr migrate -
Update configuration to enable learning:
storage:
enable_expectation_learning: true -
Profile your tables - expectations will be learned automatically after sufficient runs
Best Practices
- Start with defaults: Use default configuration initially, then tune based on your data patterns
- Monitor sample sizes: Ensure you have enough historical data before relying on expectations
- Review control limits: Periodically check if control limits make sense for your data
- Combine with baselines: Use both baselines and expectations for comprehensive monitoring
- Window size: Match
learning_window_daysto your data update frequency- Daily updates: 30 days
- Weekly updates: 90 days
- Monthly updates: 180 days
Related Documentation
- Drift Detection Guide - Understanding baselines and drift detection
- Profiling Enrichment - Other enrichment features
- Architecture: Expectation Learning - Technical details