Baselinr Configuration Reference
Complete reference for all Baselinr configuration options with detailed explanations and examples.
Table of Contents
- Configuration Overview
- Basic Configuration
- Source Configuration
- Storage Configuration
- Profiling Configuration
- Drift Detection Configuration
- Hooks Configuration
- Monitoring Configuration
- Retry Configuration
- Execution Configuration
- Incremental Configuration
- Schema Change Configuration
- Full Configuration Example
Configuration Overview
Baselinr configuration is defined in YAML or JSON format. All configuration files start with:
environment: development # or test, production
source: {...} # Source database connection
storage: {...} # Storage configuration
Most other sections are optional and have sensible defaults.
Basic Configuration
environment
Environment name for this configuration.
Type: string
Values: development, test, production
Default: development
Example:
environment: production
Source Configuration
source
Source database connection configuration.
Type: ConnectionConfig
Required: Yes
Example:
source:
type: postgres
host: localhost
port: 5432
database: my_database
username: my_user
password: my_password
schema: public
Connection Types
PostgreSQL
source:
type: postgres
host: localhost
port: 5432
database: my_database
username: my_user
password: my_password
schema: public # Optional
Snowflake
source:
type: snowflake
account: myaccount
warehouse: compute_wh
database: my_database
schema: my_schema
username: my_user
password: my_password
role: my_role # Optional
SQLite
source:
type: sqlite
filepath: ./database.db
MySQL
source:
type: mysql
host: localhost
port: 3306
database: my_database
username: my_user
password: my_password
BigQuery
source:
type: bigquery
database: my_project.my_dataset
extra_params:
credentials_path: /path/to/key.json
Redshift
source:
type: redshift
host: my-cluster.xxxxx.us-east-1.redshift.amazonaws.com
port: 5439
database: my_database
username: my_user
password: my_password
Fields:
type(str, required): Database type (postgres, snowflake, sqlite, mysql, bigquery, redshift)host(Optional[str]): Database host (not required for sqlite)port(Optional[int]): Database portdatabase(str, required): Database nameusername(Optional[str]): Usernamepassword(Optional[str]): Passwordschema(Optional[str]): Schema name (alias:schema_)account(Optional[str]): Snowflake accountwarehouse(Optional[str]): Snowflake warehouserole(Optional[str]): Snowflake rolefilepath(Optional[str]): SQLite file pathextra_params(Dict[str, Any]): Additional connection parameters
Storage Configuration
storage
Storage configuration for profiling results.
Type: StorageConfig
Required: Yes
Example:
storage:
connection:
type: postgres
host: localhost
port: 5432
database: baselinr_metadata
username: baselinr
password: password
results_table: baselinr_results
runs_table: baselinr_runs
create_tables: true
Fields:
connection(ConnectionConfig, required): Storage database connectionresults_table(str): Results table name (default:baselinr_results)runs_table(str): Runs table name (default:baselinr_runs)create_tables(bool): Automatically create tables (default:true)enable_expectation_learning(bool): Enable expectation learning (default:false)learning_window_days(int): Historical window for learning (default:30)min_samples(int): Minimum runs for learning (default:5)ewma_lambda(float): EWMA smoothing parameter (default:0.2)enable_anomaly_detection(bool): Enable anomaly detection (default:false)anomaly_enabled_methods(List[str]): Enabled anomaly methods (default: all methods)anomaly_iqr_threshold(float): IQR threshold (default:1.5)anomaly_mad_threshold(float): MAD threshold (default:3.0)anomaly_ewma_deviation_threshold(float): EWMA deviation threshold (default:2.0)anomaly_seasonality_enabled(bool): Enable seasonality detection (default:true)anomaly_regime_shift_enabled(bool): Enable regime shift detection (default:true)anomaly_regime_shift_window(int): Regime shift window size (default:3)anomaly_regime_shift_sensitivity(float): Regime shift p-value threshold (default:0.05)
Profiling Configuration
profiling
Profiling behavior configuration.
Type: ProfilingConfig
Required: No (has defaults)
Example:
profiling:
tables:
- table: customers
schema: public
partition:
key: order_date
strategy: latest
sampling:
enabled: true
method: random
fraction: 0.01
max_rows: 1000000
max_distinct_values: 1000
compute_histograms: true
histogram_bins: 10
metrics:
- count
- null_count
- null_ratio
- distinct_count
- mean
- stddev
- histogram
Fields:
tables(List[TablePattern]): List of tables to profile (default:[])max_distinct_values(int): Maximum distinct values to compute (default:1000)compute_histograms(bool): Compute histograms (default:true)histogram_bins(int): Number of histogram bins (default:10)metrics(List[str]): Metrics to compute (default: all standard metrics)default_sample_ratio(float): Default sampling ratio (default:1.0)enable_enrichment(bool): Enable profiling enrichment (default:true)enable_approx_distinct(bool): Enable approximate distinct count (default:true)enable_schema_tracking(bool): Enable schema change tracking (default:true)enable_type_inference(bool): Enable data type inference (default:true)enable_column_stability(bool): Enable column stability tracking (default:true)stability_window(int): Stability calculation window (default:7)type_inference_sample_size(int): Type inference sample size (default:1000)
TablePattern
Configuration for a single table.
Fields:
table(str, required): Table nameschema(Optional[str]): Schema name (alias:schema_)partition(Optional[PartitionConfig]): Partition configurationsampling(Optional[SamplingConfig]): Sampling configuration
PartitionConfig
Partition-aware profiling configuration.
Fields:
key(Optional[str]): Partition column namestrategy(str): Partition strategy (default:all)all: Profile all partitionslatest: Profile latest partitionrecent_n: Profile N recent partitionssample: Sample partitionsspecific_values: Profile specific partition values
recent_n(Optional[int]): Number of recent partitions (required forrecent_nstrategy)values(Optional[List[Any]]): Specific partition values (required forspecific_valuesstrategy)metadata_fallback(bool): Try to infer partition key (default:true)
SamplingConfig
Sampling configuration for profiling.
Fields:
enabled(bool): Enable sampling (default:false)method(str): Sampling method (default:random)random: Random samplingstratified: Stratified samplingtopk: Top-K sampling
fraction(float): Fraction of rows to sample (default:0.01)max_rows(Optional[int]): Maximum rows to sample
Available Metrics:
count: Row countnull_count: Number of null valuesnull_ratio: Ratio of null valuesdistinct_count: Number of distinct valuesunique_ratio: Ratio of unique valuesapprox_distinct_count: Approximate distinct countmin: Minimum valuemax: Maximum valuemean: Mean valuestddev: Standard deviationhistogram: Value distribution histogramdata_type_inferred: Inferred data type
Drift Detection Configuration
drift_detection
Drift detection configuration.
Type: DriftDetectionConfig
Required: No (has defaults)
Example:
drift_detection:
strategy: absolute_threshold
absolute_threshold:
low_threshold: 5.0
medium_threshold: 15.0
high_threshold: 30.0
baselines:
strategy: auto
windows:
moving_average: 7
prior_period: 7
min_runs: 3
enable_type_specific_thresholds: true
type_specific_thresholds:
numeric:
mean:
low: 10.0
medium: 25.0
high: 50.0
default:
low: 5.0
medium: 15.0
high: 30.0
Fields:
strategy(str): Drift detection strategy (default:absolute_threshold)absolute_threshold: Percentage change thresholdsstandard_deviation: Standard deviation basedstatistical: Statistical tests (KS, PSI, etc.)ml_based: Machine learning based (placeholder)
absolute_threshold(Dict[str, float]): Absolute threshold parameterslow_threshold: Low severity threshold (default:5.0)medium_threshold: Medium severity threshold (default:15.0)high_threshold: High severity threshold (default:30.0)
standard_deviation(Dict[str, float]): Standard deviation parameterslow_threshold: Low severity threshold in std devs (default:1.0)medium_threshold: Medium severity threshold (default:2.0)high_threshold: High severity threshold (default:3.0)
statistical(Dict[str, Any]): Statistical test parameterstests: List of tests to run (default:["ks_test", "psi", "chi_square"])sensitivity: Sensitivity level (default:medium)test_params: Test-specific parameters
baselines(Dict[str, Any]): Baseline selection configurationstrategy: Baseline strategy (default:last_run)auto: Auto-select best baselinelast_run: Use last runmoving_average: Use moving averageprior_period: Use prior periodstable_window: Use stable window
windows: Window configurationmoving_average: Number of runs for moving average (default:7)prior_period: Days for prior period (default:7)min_runs: Minimum runs required (default:3)
enable_type_specific_thresholds(bool): Enable type-specific thresholds (default:true)type_specific_thresholds(Dict[str, Dict[str, Dict[str, float]]]): Type-specific threshold overrides
Hooks Configuration
hooks
Event hooks configuration.
Type: HooksConfig
Required: No (has defaults)
Example:
hooks:
enabled: true
hooks:
- type: logging
enabled: true
log_level: INFO
- type: slack
enabled: true
webhook_url: https://hooks.slack.com/services/YOUR/WEBHOOK/URL
channel: "#data-alerts"
min_severity: medium
alert_on_drift: true
alert_on_schema_change: true
Fields:
enabled(bool): Master switch for all hooks (default:true)hooks(List[HookConfig]): List of hook configurations
HookConfig
Configuration for a single hook.
Fields:
type(str, required): Hook typelogging: Log eventssql: Store events in SQL databasesnowflake: Store events in Snowflakeslack: Send Slack notificationscustom: Custom hook
enabled(bool): Enable this hook (default:true)log_level(Optional[str]): Log level for logging hook (default:INFO)connection(Optional[ConnectionConfig]): Database connection for SQL/Snowflake hookstable_name(Optional[str]): Table name for SQL/Snowflake hooks (default:baselinr_events)webhook_url(Optional[str]): Webhook URL for Slack hookchannel(Optional[str]): Slack channelusername(Optional[str]): Slack username (default:Baselinr)min_severity(Optional[str]): Minimum severity to alert (default:low)alert_on_drift(Optional[bool]): Alert on drift events (default:true)alert_on_schema_change(Optional[bool]): Alert on schema changes (default:true)alert_on_profiling_failure(Optional[bool]): Alert on profiling failures (default:true)timeout(Optional[int]): Request timeout in seconds (default:10)module(Optional[str]): Module path for custom hookclass_name(Optional[str]): Class name for custom hookparams(Dict[str, Any]): Additional parameters for custom hook
Monitoring Configuration
monitoring
Prometheus metrics configuration.
Type: MonitoringConfig
Required: No (has defaults)
Example:
monitoring:
enable_metrics: true
port: 9753
keep_alive: true
Fields:
enable_metrics(bool): Enable Prometheus metrics (default:false)port(int): Metrics server port (default:9753)keep_alive(bool): Keep server running after profiling (default:true)
Retry Configuration
retry
Retry and recovery configuration.
Type: RetryConfig
Required: No (has defaults)
Example:
retry:
enabled: true
retries: 3
backoff_strategy: exponential
min_backoff: 0.5
max_backoff: 8.0
Fields:
enabled(bool): Enable retry logic (default:true)retries(int): Maximum retry attempts (default:3, range: 0-10)backoff_strategy(str): Backoff strategy (default:exponential)exponential: Exponential backofffixed: Fixed backoff
min_backoff(float): Minimum backoff in seconds (default:0.5)max_backoff(float): Maximum backoff in seconds (default:8.0)
Execution Configuration
execution
Parallel execution configuration.
Type: ExecutionConfig
Required: No (has defaults)
Example:
execution:
max_workers: 4
batch_size: 10
queue_size: 100
warehouse_limits:
snowflake: 20
postgres: 8
sqlite: 1
Fields:
max_workers(int): Maximum parallel workers (default:1, sequential)batch_size(int): Tables per batch (default:10)queue_size(int): Maximum queue size (default:100)warehouse_limits(Dict[str, int]): Warehouse-specific worker limits
Note: Default is sequential execution (max_workers=1) for backward compatibility. Set max_workers > 1 to enable parallelism.
Incremental Configuration
incremental
Incremental profiling configuration.
Type: IncrementalConfig
Required: No (has defaults)
Example:
incremental:
enabled: true
change_detection:
enabled: true
metadata_table: baselinr_table_state
snapshot_ttl_minutes: 1440
partial_profiling:
enabled: true
allow_partition_pruning: true
max_partitions_per_run: 64
adaptive_scheduling:
enabled: true
default_interval_minutes: 1440
cost_controls:
enabled: true
max_bytes_scanned: 1000000000
fallback_strategy: sample
Fields:
enabled(bool): Enable incremental profiling (default:false)change_detection: Change detection configurationenabled(bool): Enable change detection (default:true)metadata_table(str): Metadata cache table (default:baselinr_table_state)snapshot_ttl_minutes(int): Cache TTL in minutes (default:1440)
partial_profiling: Partial profiling configurationenabled(bool): Enable partial profiling (default:true)allow_partition_pruning(bool): Allow partition pruning (default:true)max_partitions_per_run(int): Max partitions per run (default:64)
adaptive_scheduling: Adaptive scheduling configurationenabled(bool): Enable adaptive scheduling (default:true)default_interval_minutes(int): Default interval in minutes (default:1440)
cost_controls: Cost control configurationenabled(bool): Enable cost controls (default:true)max_bytes_scanned(Optional[int]): Max bytes per runmax_rows_scanned(Optional[int]): Max rows per runfallback_strategy(str): Fallback strategy (default:sample)
Schema Change Configuration
schema_change
Schema change detection configuration.
Type: SchemaChangeConfig
Required: No (has defaults)
Example:
schema_change:
enabled: true
similarity_threshold: 0.7
suppression:
- table: staging_table
change_type: column_added
Fields:
enabled(bool): Enable schema change detection (default:true)similarity_threshold(float): Similarity threshold for rename detection (default:0.7)suppression(List[SchemaChangeSuppressionRule]): Suppression rules
Full Configuration Example
See examples/config.yml for a complete configuration example with all options.
Related Documentation
- API Reference - Complete API documentation
- Installation Guide - Installation instructions
- Quick Start Guide - Quick start tutorial
- Drift Detection Guide - Drift detection details
- Python SDK Guide - SDK usage guide
- Best Practices Guide - Configuration best practices
- Troubleshooting Guide - Configuration troubleshooting