Expectation Learning Architecture

Technical architecture documentation for the expectation learning system in Baselinr.

Overview

The expectation learning system automatically computes expected metric ranges from historical profiling data. It learns statistical properties, control limits, distributions, and categorical frequencies to enable automatic outlier detection without explicit thresholds.

System Design

High-Level Flow

Profiling Run Complete
    ↓
ResultWriter.write_results()
    ↓
_learn_expectations() [if enabled]
    ↓
For each column + numeric metric:
    ↓
ExpectationLearner.learn_expectations()
    ↓
Query historical metrics (window_days)
    ↓
Check sample size >= min_samples
    ↓
Compute statistics, control limits, distributions
    ↓
ExpectationStorage.save_expectation()
    ↓
Store in baselinr_expectations table

Components

1. ExpectationLearner (`baselinr/learning/expectation_learner.py`)

Main learning engine that computes expectations from historical data.

Key Methods:

learn_expectations() - Main entry point, orchestrates learning
_get_historical_metrics() - Queries historical metric values from database
_compute_expected_statistics() - Calculates mean, variance, stddev, min, max
_compute_control_limits() - Computes Shewhart 3-sigma limits
_compute_ewma() - Calculates Exponentially Weighted Moving Average
_learn_distribution() - Detects distribution type (normal vs empirical)
_learn_categorical_distribution() - Learns category frequencies

Design Decisions:

Uses Python statistics module for calculations (reliable, standard library)
Requires minimum 5 samples by default (configurable)
Uses 30-day window by default (configurable)
Skips learning gracefully if insufficient data (logs debug, continues)

2. ExpectationStorage (`baselinr/learning/expectation_storage.py`)

Persistence layer for learned expectations.

Key Methods:

save_expectation() - Upsert logic (insert or update)
get_expectation() - Retrieve by table/column/metric
_insert_expectation(), _update_expectation() - SQL execution
_expectation_to_params(), _row_to_expectation() - Serialization

Design Decisions:

Uses upsert pattern (check existence, then insert or update)
Updates increment expectation_version for audit trail
Stores JSON fields (distribution_params, category_distribution) as TEXT
Handles NULL schema_name gracefully

3. ResultWriter Integration (`baselinr/storage/writer.py`)

Integration point where learning is triggered after profiling.

Integration:

Called after enrichment metrics are calculated
Only executes if config.enable_expectation_learning is True
Iterates through columns and numeric metrics
Handles errors gracefully (logs warning, continues)
Doesn't block profiling completion if learning fails

Database Schema

Table: `baselinr_expectations`

CREATE TABLE baselinr_expectations (
    id INTEGER PRIMARY KEY AUTO_INCREMENT,
    table_name VARCHAR(255) NOT NULL,
    schema_name VARCHAR(255),
    column_name VARCHAR(255) NOT NULL,
    metric_name VARCHAR(100) NOT NULL,
    column_type VARCHAR(100),
    
    -- Expected statistics
    expected_mean FLOAT,
    expected_variance FLOAT,
    expected_stddev FLOAT,
    expected_min FLOAT,
    expected_max FLOAT,
    
    -- Control limits
    lower_control_limit FLOAT,
    upper_control_limit FLOAT,
    lcl_method VARCHAR(50),      -- 'shewhart', 'ewma', etc.
    ucl_method VARCHAR(50),
    
    -- EWMA
    ewma_value FLOAT,
    ewma_lambda FLOAT DEFAULT 0.2,
    
    -- Distribution
    distribution_type VARCHAR(50),  -- 'normal', 'empirical', etc.
    distribution_params TEXT,        -- JSON of distribution parameters
    
    -- Categorical
    category_distribution TEXT,      -- JSON: {"value": frequency, ...}
    
    -- Learning metadata
    sample_size INTEGER,
    learning_window_days INTEGER,
    last_updated TIMESTAMP NOT NULL,
    created_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
    expectation_version INTEGER DEFAULT 1,
    
    UNIQUE KEY unique_expectation (table_name, schema_name, column_name, metric_name),
    INDEX idx_table_column (table_name, schema_name, column_name),
    INDEX idx_last_updated (last_updated DESC)
);

Schema Design Decisions

Unique Constraint: One expectation per table/column/metric combination
Version Tracking: expectation_version increments on each update for audit trail
JSON Storage: Distribution params and category distributions stored as TEXT (JSON)
- Allows flexibility for different distribution types
- Easy to query but requires JSON parsing
Indexes: Optimized for lookup by table/column and by update time
Nullable Fields: Many fields nullable to handle partial learning (e.g., no categorical data)

Control Limits

Shewhart Control Limits

Method: 3-sigma limits (mean ± 3 × standard deviation)

Formula:

LCL = mean - (3 × stddev)
UCL = mean + (3 × stddev)

When to Use: Default method, works well for normally distributed data

Limitations:

Assumes normal distribution
May be too wide or too narrow for skewed data
Doesn't account for trends

EWMA (Exponentially Weighted Moving Average)

Method: Smooths historical values with exponential weighting

Formula:

EWMA(t) = λ × value(t) + (1 - λ) × EWMA(t-1)

Where λ (lambda) is the smoothing parameter (default: 0.2)

Control Limits: Computed using asymptotic variance of EWMA statistic

When to Use: Better for detecting small, gradual shifts

Current Status: EWMA value is computed but not yet used for control limits (planned enhancement)

Distribution Learning

Normal Distribution Detection

Heuristic:

Calculate skewness (third moment)
Compare mean and median
If |skewness| < 0.5 AND |mean - median| / mean < 0.2: classify as normal

Distribution Parameters (normal):

{
  "mean": 42.5,
  "stddev": 3.2
}

Empirical Distribution

When: Data doesn't fit normal distribution or insufficient data

Distribution Parameters (empirical):

{
  "mean": 42.5,
  "stddev": 3.2,
  "min": 35.0,
  "max": 50.0,
  "skewness": 1.2
}

Future Enhancements:

Support for other distributions (exponential, uniform, etc.)
Statistical tests (Shapiro-Wilk, Anderson-Darling) for normality

Categorical Distribution Learning

Process

Query historical runs for top_values or category_distribution metrics
Aggregate category counts across runs
Normalize to frequencies: frequency(category) = count(category) / total_count
Store as JSON: {"category1": 0.45, "category2": 0.30, ...}

Example

{
  "active": 0.65,
  "inactive": 0.25,
  "pending": 0.10
}

Use Case: Detect unexpected category values or frequency shifts

Limitations:

Requires top_values or category_distribution metric to be computed during profiling
Only learns from historical data, may miss new categories

Integration Points

1. Profiling Workflow

Location: baselinr/storage/writer.py::write_results()

Trigger: After enrichment metrics are calculated

Flow:

if config.enable_expectation_learning:
    self._learn_expectations(result)

2. Configuration

Location: baselinr/config/schema.py::StorageConfig

Options:

enable_expectation_learning: bool
learning_window_days: int
min_samples: int
ewma_lambda: float

3. Database Migration

Location: baselinr/storage/migrations/versions/v3_expectations.py

Migration: Creates baselinr_expectations table

Version: Schema version 3

Dialect Support: Generic SQL and Snowflake-specific DDL

Performance Considerations

Query Performance

Historical Metrics Query:

Joins baselinr_results with baselinr_runs
Filters by table, column, metric, date window
Orders by timestamp
Optimization: Indexes on (dataset_name, column_name, metric_name) and profiled_at

Impact:

Runs once per column/metric combination after each profiling run
Minimal impact if profiled tables have few columns/metrics
Scales with number of tables × columns × metrics

Storage Impact

Per Expectation: ~500 bytes (with JSON fields)

Estimation:

100 tables × 10 columns × 5 metrics = 5,000 expectations
Total storage: ~2.5 MB (negligible)

Growth: Linear with number of table/column/metric combinations

Learning Time

Per Expectation: ~10-50ms (database query + computation)

Total Time:

100 tables × 10 columns × 5 metrics = 5,000 expectations
Total time: ~50-250 seconds (can run in parallel, not currently implemented)

Recommendation: Learning runs after profiling, doesn't block user-facing operations

Error Handling

Graceful Degradation

Insufficient Samples: Logs debug, returns None, continues
Query Errors: Logs warning, skips that metric, continues
Computation Errors: Logs warning, skips that metric, continues
Storage Errors: Logs warning, continues with other metrics

Design Principle: Learning failures should never break profiling or drift detection

Logging

Debug: Insufficient samples, skipped learning
Warning: Errors during learning (query, computation, storage)
Info: Not used (to avoid noise)

Future Enhancements

EWMA Control Limits: Use EWMA for control limits instead of just Shewhart
Distribution Tests: Add statistical tests for normality (Shapiro-Wilk, etc.)
Parallel Learning: Learn expectations in parallel for multiple metrics
Incremental Updates: Update expectations incrementally instead of full recomputation
Outlier Detection: Automatic outlier detection using learned expectations
Adaptive Windows: Automatically adjust window size based on data frequency
Multiple Distribution Types: Support exponential, uniform, log-normal, etc.
Time-Series Aware: Account for trends, seasonality in expectations
Per-Table Configuration: Allow different learning parameters per table

Testing Strategy

Unit Tests

ExpectationLearner:
- Test statistics computation with known values
- Test control limits calculation
- Test distribution detection heuristics
- Test insufficient samples handling
ExpectationStorage:
- Test insert/update/retrieve operations
- Test serialization/deserialization
- Test NULL handling

Integration Tests

End-to-end: Profile → Learn → Store → Retrieve
Multiple runs: Ensure expectations update correctly
Error scenarios: Test graceful degradation

Migration Tests

Verify v3 migration creates table correctly
Test both generic SQL and Snowflake dialects

User Guide: Expectation Learning - User-facing documentation
Drift Detection Architecture - Related baseline system
Storage Schema - Database schema reference

Overview​

System Design​

High-Level Flow​

Components​

1. ExpectationLearner (baselinr/learning/expectation_learner.py)​

2. ExpectationStorage (baselinr/learning/expectation_storage.py)​

3. ResultWriter Integration (baselinr/storage/writer.py)​

Database Schema​

Table: baselinr_expectations​

Schema Design Decisions​

Control Limits​

Shewhart Control Limits​

EWMA (Exponentially Weighted Moving Average)​

Distribution Learning​

Normal Distribution Detection​

Empirical Distribution​

Categorical Distribution Learning​

Process​

Example​

Integration Points​

1. Profiling Workflow​

2. Configuration​

3. Database Migration​

Performance Considerations​

Query Performance​

Storage Impact​

Learning Time​

Error Handling​

Graceful Degradation​

Logging​

Future Enhancements​

Testing Strategy​

Unit Tests​

Integration Tests​

Migration Tests​

Related Documentation​