Column-Level Configuration Guide
Baselinr supports fine-grained column-level configurations for profiling, drift detection, and anomaly detection. This allows you to control exactly which columns are profiled, which metrics are computed, and how drift/anomaly detection behaves on a per-column basis.
Overview
Column-level configurations are defined in ODCS contracts. All column configs (profiling, drift, validation, anomaly) are nested within each column definition in the contract:
# contracts/customers.odcs.yaml
kind: DataContract
apiVersion: v3.1.0
dataset:
- name: customers
physicalName: public.customers
columns:
- column: email
quality:
- type: format
rule: format
specification:
pattern: "^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$"
severity: error
- type: not_null
rule: not_null
severity: error
customProperties:
- property: baselinr.anomaly.customers.email
value:
enabled: true
methods: [control_limits, iqr]
- property: baselinr.drift.customers.email
value:
enabled: true
thresholds:
low: 2.0
medium: 5.0
high: 10.0
Key Features
- Column Selection: Choose which columns to profile using explicit names or patterns
- Per-Column Metrics: Specify which metrics to compute for each column
- Per-Column Drift Control: Customize drift thresholds, strategies, and enable/disable per column
- Per-Column Anomaly Control: Configure anomaly detection methods and thresholds per column
- Pattern Matching: Use wildcards (
*_id) or regex patterns for column names - Dependency Management: Automatic handling of dependencies (drift/anomaly require profiling)
Configuration Structure
Column Configuration in ODCS Contracts
All column-level configuration is defined in ODCS contracts within the dataset[].columns[] section:
# contracts/customers.odcs.yaml
kind: DataContract
apiVersion: v3.1.0
dataset:
- name: customers
physicalName: public.customers
columns:
- column: email
# Validation rules via quality field
quality:
- type: format
rule: format
specification:
pattern: "^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$"
severity: error
# Profiling, drift, and anomaly via customProperties
customProperties:
- property: baselinr.anomaly.customers.email
value:
enabled: true
methods: [control_limits, iqr]
thresholds:
iqr_threshold: 1.5
mad_threshold: 2.0
- property: baselinr.drift.customers.email
value:
enabled: true
strategy: absolute_threshold
thresholds:
low: 2.0
medium: 5.0
high: 10.0
Important: Column-level configurations are defined in ODCS contracts using:
qualityfield for validation rulescustomPropertieswithbaselinr.*prefixes for profiling, drift, and anomaly settings
See ODCS Data Contracts Guide for complete documentation.
Column Selection
Explicit Column Names
Specify exact column names to configure:
columns:
- name: email
metrics: [count, null_count, distinct_count]
- name: age
metrics: [count, mean, stddev, min, max]
Wildcard Patterns
Use wildcard patterns to match multiple columns:
columns:
- name: "*_id" # Matches: customer_id, order_id, product_id
metrics: [count, null_count]
- name: "email*" # Matches: email, email_address, email_verified
metrics: [count, null_count, distinct_count]
Wildcard Syntax:
*matches any sequence of characters?matches a single character
Regex Patterns
For more complex patterns, use regex:
columns:
- name: "^(customer|order|product)_id$"
pattern_type: regex
metrics: [count, null_count]
Excluding Columns
To skip profiling specific columns, set profiling.enabled: false:
columns:
- name: internal_notes
profiling:
enabled: false # Column won't be profiled
- name: "*_temp" # Skip all temporary columns
profiling:
enabled: false
Profiling Configuration
Select Which Columns to Profile
By default, all columns are profiled. When you specify columns in an ODCS contract, only matching columns are profiled.
# contracts/customers.odcs.yaml
kind: DataContract
apiVersion: v3.1.0
dataset:
- name: customers
physicalName: public.customers
columns:
- column: email
- column: age
- column: name
# Only email, age, and name will be profiled
To profile everything except specific columns, use customProperties:
# contracts/customers.odcs.yaml
kind: DataContract
apiVersion: v3.1.0
dataset:
- name: customers
physicalName: public.customers
columns:
- column: internal_notes
customProperties:
- property: baselinr.profiling.customers.internal_notes
value:
enabled: false
Custom Metrics Per Column
Override table-level metrics for specific columns using customProperties:
# contracts/customers.odcs.yaml
kind: DataContract
apiVersion: v3.1.0
dataset:
- name: customers
physicalName: public.customers
columns:
- column: email
- column: age
- column: metadata_json
customProperties:
- property: baselinr.profiling.customers.email
value:
metrics: [count, null_count, distinct_count]
- property: baselinr.profiling.customers.age
value:
metrics: [count, mean, stddev, min, max, null_ratio]
- property: baselinr.profiling.customers.metadata_json
value:
metrics: [count, null_count]
Available Metrics:
count- Total row countnull_count- Number of null valuesnull_ratio- Ratio of nulls (0.0-1.0)distinct_count- Number of distinct valuesunique_ratio- Ratio of distinct to totalapprox_distinct_count- Approximate distinct countmin- Minimum valuemax- Maximum valuemean- Average (numeric)stddev- Standard deviation (numeric)histogram- Distribution histogram (numeric)data_type_inferred- Inferred semantic typemin_length,max_length,avg_length- String length metrics
Drift Detection Configuration
Per-Column Drift Thresholds
Override global drift thresholds for specific columns using ODCS contracts:
# contracts/customers.odcs.yaml
kind: DataContract
apiVersion: v3.1.0
dataset:
- name: customers
physicalName: public.customers
columns:
- column: lifetime_value
customProperties:
- property: baselinr.drift.customers.lifetime_value
value:
enabled: true
thresholds:
low: 5.0 # 5% change = low severity
medium: 10.0 # 10% change = medium severity
high: 20.0 # 20% change = high severity
Disable Drift Detection Per Column
Skip drift detection for specific columns using ODCS contracts:
# contracts/customers.odcs.yaml
kind: DataContract
apiVersion: v3.1.0
dataset:
- name: customers
physicalName: public.customers
columns:
- column: internal_notes
- column: customer_id
customProperties:
- property: baselinr.drift.customers.internal_notes
value:
enabled: false # No drift detection for this column
- property: baselinr.drift.customers.customer_id
value:
enabled: false # No drift for ID columns
Per-Column Drift Strategy
Override drift strategy for specific columns using ODCS contracts:
# contracts/customers.odcs.yaml
kind: DataContract
apiVersion: v3.1.0
dataset:
- name: customers
physicalName: public.customers
columns:
- column: amount
- column: status
customProperties:
- property: baselinr.drift.customers.amount
value:
strategy: statistical # Use statistical tests for this column
thresholds: {...}
- property: baselinr.drift.customers.status
value:
strategy: absolute_threshold # Use simple thresholds
thresholds:
low: 2.0
medium: 5.0
high: 10.0
Per-Column Baseline Selection
Override baseline selection strategy per column using ODCS contracts:
# contracts/customers.odcs.yaml
kind: DataContract
apiVersion: v3.1.0
dataset:
- name: customers
physicalName: public.customers
columns:
- column: daily_revenue
customProperties:
- property: baselinr.drift.customers.daily_revenue
value:
baselines:
strategy: prior_period # Use prior period for seasonality
windows:
prior_period: 7 # Same day last week
Anomaly Detection Configuration
Per-Column Anomaly Methods
Enable specific anomaly detection methods per column using ODCS contracts:
# contracts/orders.odcs.yaml
kind: DataContract
apiVersion: v3.1.0
dataset:
- name: orders
physicalName: public.orders
columns:
- column: amount
- column: order_date
customProperties:
- property: baselinr.anomaly.orders.amount
value:
enabled: true
methods: [control_limits, iqr, mad] # Only these methods
- property: baselinr.anomaly.orders.order_date
value:
enabled: true
methods: [seasonality, regime_shift] # Focus on temporal patterns
Per-Column Anomaly Thresholds
Customize anomaly detection sensitivity per column using ODCS contracts:
# contracts/orders.odcs.yaml
kind: DataContract
apiVersion: v3.1.0
dataset:
- name: orders
physicalName: public.orders
columns:
- column: amount
- column: quantity
customProperties:
- property: baselinr.anomaly.orders.amount
value:
enabled: true
thresholds:
iqr_threshold: 2.0 # More sensitive (default: 1.5)
mad_threshold: 3.5 # More sensitive (default: 3.0)
ewma_deviation_threshold: 2.5 # More sensitive (default: 2.0)
- property: baselinr.anomaly.orders.quantity
value:
enabled: true
thresholds:
iqr_threshold: 3.0 # Less sensitive
Disable Anomaly Detection Per Column
# contracts/customers.odcs.yaml
kind: DataContract
apiVersion: v3.1.0
dataset:
- name: customers
physicalName: public.customers
columns:
- column: metadata_json
customProperties:
- property: baselinr.anomaly.customers.metadata_json
value:
enabled: false # Skip anomaly detection
Dependency Management
Understanding Dependencies
There's a critical dependency chain:
Profiling → Drift Detection → Anomaly Detection
- Profiling must run first (produces metrics)
- Drift Detection requires profiling (compares metrics across runs)
- Anomaly Detection requires profiling (analyzes current run's metrics)
Automatic Dependency Handling
Baselinr automatically handles dependencies:
- If a column is not profiled (
profiling.enabled: false), drift and anomaly detection are automatically skipped - If drift/anomaly is configured but profiling is disabled, a warning is issued
- Columns without profiling cannot have drift or anomaly detection enabled
Example: Invalid Configuration (Will Warn)
# contracts/customers.odcs.yaml
kind: DataContract
apiVersion: v3.1.0
dataset:
- name: customers
physicalName: public.customers
columns:
- column: metadata
customProperties:
- property: baselinr.profiling.customers.metadata
value:
enabled: false # ❌ Profiling disabled
- property: baselinr.drift.customers.metadata
value:
enabled: true # ⚠️ Warning: Drift requires profiling
- property: baselinr.anomaly.customers.metadata
value:
enabled: true # ⚠️ Warning: Anomaly requires profiling
Result: Warnings are logged, and drift/anomaly are automatically skipped for this column.
Complete Examples
Example 1: Basic Column Selection
# contracts/customers.odcs.yaml
kind: DataContract
apiVersion: v3.1.0
dataset:
- name: customers
physicalName: public.customers
columns:
- column: email
- column: age
- column: customer_id
customProperties:
- property: baselinr.profiling.customers.email
value:
metrics: [count, null_count, distinct_count]
- property: baselinr.profiling.customers.age
value:
metrics: [count, mean, stddev, min, max]
- property: baselinr.profiling.customers.customer_id
value:
metrics: [count, null_count]
Example 2: Selective Profiling with Drift Control
# contracts/customers.odcs.yaml
kind: DataContract
apiVersion: v3.1.0
dataset:
- name: customers
physicalName: public.customers
columns:
- column: email
- column: lifetime_value
- column: internal_notes
- column: customer_id
customProperties:
- property: baselinr.profiling.customers.email
value:
metrics: [count, null_count, distinct_count]
- property: baselinr.drift.customers.email
value:
enabled: true
thresholds:
low: 2.0
medium: 5.0
high: 10.0
- property: baselinr.profiling.customers.lifetime_value
value:
metrics: [count, mean, stddev, min, max]
- property: baselinr.drift.customers.lifetime_value
value:
enabled: true
thresholds:
low: 5.0
medium: 15.0
high: 30.0
- property: baselinr.profiling.customers.internal_notes
value:
enabled: false # Not profiled
- property: baselinr.drift.customers.customer_id
value:
enabled: false # No drift detection for IDs
Example 3: Full Configuration with Anomaly Detection
# contracts/orders.odcs.yaml
kind: DataContract
apiVersion: v3.1.0
dataset:
- name: orders
physicalName: public.orders
columns:
- column: amount
- column: order_date
- column: notes
customProperties:
- property: baselinr.profiling.orders.amount
value:
metrics: [count, mean, stddev, min, max]
- property: baselinr.drift.orders.amount
value:
enabled: true
strategy: absolute_threshold
thresholds:
low: 5.0
medium: 15.0
high: 30.0
- property: baselinr.anomaly.orders.amount
value:
enabled: true
methods: [control_limits, iqr, mad]
thresholds:
iqr_threshold: 2.0
mad_threshold: 3.5
- property: baselinr.profiling.orders.order_date
value:
metrics: [count, min, max]
- property: baselinr.anomaly.orders.order_date
value:
enabled: true
methods: [seasonality, regime_shift]
- property: baselinr.profiling.orders.notes
value:
enabled: false # Skip entirely
Example 4: Pattern-Based Column Configuration
Note: ODCS contracts don't support pattern-based column matching directly. You need to explicitly list columns or use multiple contracts. For pattern-based matching, consider using the global profiling.tables configuration with patterns.
# config.yml
profiling:
tables:
- table: events
schema: public
# Pattern matching is handled at table level, not column level
For column-specific configurations, define them explicitly in ODCS contracts.
Schema-Level Configuration
Note: Schema-level configurations are now handled via ODCS contracts. You can organize contracts by schema (e.g., contracts/analytics/*.odcs.yaml) or include multiple datasets in a single contract to apply common settings.
Overview
With ODCS contracts, you can define common settings for multiple tables by:
- Creating separate contract files per schema (e.g.,
contracts/analytics/orders.odcs.yaml) - Including multiple datasets in one contract file
- Using customProperties to apply schema-wide policies
Schema Configuration with ODCS Contracts
# contracts/analytics_schema.odcs.yaml
kind: DataContract
apiVersion: v3.1.0
dataset:
- name: orders
physicalName: warehouse.analytics.orders
columns:
- column: order_id
- column: total_amount
customProperties:
- property: baselinr.partition.orders
value:
strategy: latest
key: date
- property: baselinr.sampling.orders
value:
enabled: true
fraction: 0.1
- property: baselinr.drift.orders.order_id
value:
enabled: false # All ID columns skip drift
- property: baselinr.drift.orders.total_amount
value:
enabled: true
thresholds:
low: 1.0
How Schema Configs Work
- Schema Matching: Schema configs match based on schema name (and optionally database name)
- Config Merging: Schema configs are merged with table patterns before profiling
- Precedence: Table-level configs override schema-level configs
- Column Configs: Schema column configs are merged with table column configs (table takes precedence)
Example: Schema-Level Column Configs
Apply column configurations to all tables in a schema using ODCS contracts:
# contracts/analytics_schema.odcs.yaml
kind: DataContract
apiVersion: v3.1.0
dataset:
- name: orders
physicalName: analytics.orders
columns:
- column: order_id
- column: metadata
- name: customers
physicalName: analytics.customers
columns:
- column: customer_id
- column: email
- column: metadata
customProperties:
# Schema-wide: disable drift for all ID columns
- property: baselinr.drift.orders.order_id
value:
enabled: false
- property: baselinr.drift.customers.customer_id
value:
enabled: false
# Schema-wide: skip metadata columns
- property: baselinr.profiling.orders.metadata
value:
enabled: false
- property: baselinr.profiling.customers.metadata
value:
enabled: false
# Table-specific override: enable drift for email
- property: baselinr.drift.customers.email
value:
enabled: true
Example: Schema-Level Sampling
Apply sampling configuration to all tables in a schema:
# contracts/staging_schema.odcs.yaml
kind: DataContract
apiVersion: v3.1.0
dataset:
- name: staging_table1
physicalName: staging.table1
- name: staging_table2
physicalName: staging.table2
customProperties:
# Apply 10% sampling to all staging tables
- property: baselinr.sampling.staging_table1
value:
enabled: true
fraction: 0.1
- property: baselinr.sampling.staging_table2
value:
enabled: true
fraction: 0.1
Example: Database-Specific Schema Configs
Use database-specific schema configs for multi-database setups:
profiling:
schemas:
- schema: analytics
database: warehouse_prod
sampling:
enabled: true
fraction: 0.05 # Production: 5% sampling
- schema: analytics
database: warehouse_dev
sampling:
enabled: true
fraction: 0.2 # Development: 20% sampling
Schema Config with Pattern Matching
Schema configs apply to tables discovered via patterns:
profiling:
schemas:
- schema: analytics
columns:
- name: "*_id"
drift:
enabled: false
tables:
- pattern: "user_*" # Pattern matches user_profiles, user_sessions, etc.
schema: analytics # All matched tables inherit schema column configs
Schema Config with select_schema
With ODCS contracts, you can organize contracts by schema directory structure:
# contracts/analytics/orders.odcs.yaml
kind: DataContract
apiVersion: v3.1.0
dataset:
- name: orders
physicalName: analytics.orders
columns:
- column: order_date
partitionStatus: true
- column: temp_data
customProperties:
- property: baselinr.partition.orders
value:
strategy: latest
key: order_date
- property: baselinr.profiling.orders.temp_data
value:
enabled: false
Database-Level Configuration
Note: Database-level configurations are now handled via ODCS contracts. Organize contracts by database (e.g., contracts/warehouse/*.odcs.yaml) or use contract-level customProperties to apply database-wide policies.
Overview
With ODCS contracts, you can achieve database-level configuration by:
- Organizing contracts in database-specific directories
- Using contract-level customProperties for database-wide policies
- Including multiple datasets from different schemas in one contract
Database Configuration with ODCS Contracts
# contracts/warehouse_database.odcs.yaml
kind: DataContract
apiVersion: v3.1.0
dataset:
- name: orders
physicalName: warehouse.analytics.orders
columns:
- column: order_id
- column: temp_data
- name: customers
physicalName: warehouse.analytics.customers
columns:
- column: customer_id
- column: temp_data
customProperties:
# Database-wide: 5% sampling for all tables
- property: baselinr.sampling.orders
value:
enabled: true
fraction: 0.05
- property: baselinr.sampling.customers
value:
enabled: true
fraction: 0.05
# Database-wide: disable drift for all ID columns
- property: baselinr.drift.orders.order_id
value:
enabled: false
- property: baselinr.drift.customers.customer_id
value:
enabled: false
# Database-wide: skip temp columns
- property: baselinr.profiling.orders.temp_data
value:
enabled: false
- property: baselinr.profiling.customers.temp_data
value:
enabled: false
How Database Configs Work
- Scope: Database configs apply to ALL schemas/tables in the specified database
- Merging: Database configs are merged with schema and table configs
- Precedence: Database → Schema → Table → Column (each level can override previous)
- Column Configs: Database column configs are merged with schema and table column configs (table takes highest precedence)
Example: Database-Level Column Configs
Apply policies at the database level using ODCS contracts:
# contracts/warehouse_database.odcs.yaml
kind: DataContract
apiVersion: v3.1.0
dataset:
- name: orders
physicalName: warehouse.analytics.orders
columns:
- column: customer_id
- name: customers
physicalName: warehouse.analytics.customers
columns:
- column: customer_id
customProperties:
# Database-level: disable drift for all ID columns (can be overridden per table)
- property: baselinr.drift.orders.customer_id
value:
enabled: false
# Table-level override: enable drift for customers table
- property: baselinr.drift.customers.customer_id
value:
enabled: true
thresholds:
low: 1.0
Example: Database-Level Sampling
Apply consistent sampling strategy across all tables in a database:
# contracts/staging_db.odcs.yaml
kind: DataContract
apiVersion: v3.1.0
dataset:
- name: table1
physicalName: staging_db.schema1.table1
- name: table2
physicalName: staging_db.schema2.table2
customProperties:
# All tables in staging_db sample 5%
- property: baselinr.sampling.table1
value:
enabled: true
fraction: 0.05
- property: baselinr.sampling.table2
value:
enabled: true
fraction: 0.05
Example: Multi-Level Precedence
With ODCS contracts, precedence is handled via contract-level and dataset-level customProperties:
# contracts/warehouse_analytics_orders.odcs.yaml
kind: DataContract
apiVersion: v3.1.0
dataset:
- name: orders
physicalName: warehouse.analytics.orders
columns:
- column: created_at
partitionStatus: true
- column: customer_id
customProperties:
# Table-level: range partition strategy (overrides any database/schema defaults)
- property: baselinr.partition.orders
value:
strategy: range
key: created_at
# Table-level: disable drift for customer_id (overrides any database/schema defaults)
- property: baselinr.drift.orders.customer_id
value:
enabled: false
Result:
orderstable usesrangepartition strategy (defined in contract)customer_idcolumn has drift disabled (defined in contract)
When to Use Database-Level Configs
Use database-level configs for:
- Organization-Wide Policies: Apply consistent policies across all tables in a database
- Cost Management: Apply sampling or filtering at the database level
- Security: Disable profiling/drift for sensitive column patterns across the entire database
- Environment-Specific Settings: Different policies for
productionvsstagingdatabases
Database Configs vs Schema Configs
- Database Configs: Apply to ALL schemas/tables in the database (broadest scope)
- Schema Configs: Apply to ALL tables in a specific schema (can be database-specific)
- Table Configs: Apply to a specific table (most specific)
Use database configs for organization-wide policies, and schema configs for schema-specific overrides.
Configuration Precedence
Configurations are merged with the following precedence (highest to lowest):
- Column-level config (most specific - from ODCS contract columns)
- Contract-level config (from ODCS contract customProperties)
- Global config (defaults from
drift_detectionandstoragesections)
Example with Database, Schema, and Table Levels:
# Global defaults
drift_detection:
strategy: absolute_threshold
absolute_threshold:
low_threshold: 5.0
medium_threshold: 15.0
high_threshold: 30.0
profiling:
databases:
- database: warehouse
columns:
- name: "*_id"
drift:
enabled: false # Database-level: disable drift for all IDs
schemas:
- schema: analytics
database: warehouse
columns:
- name: "customer_id"
drift:
enabled: true # Schema-level: override for customer_id
tables:
- table: customers
schema: analytics
database: warehouse
columns:
- name: amount
# Column-level overrides schema, database, and global
drift:
enabled: true
thresholds:
low: 2.0 # Uses 2.0 instead of 5.0
medium: 10.0 # Uses 10.0 instead of 15.0
high: 20.0 # Uses 20.0 instead of 30.0
Backward Compatibility
Column-level configurations are fully backward compatible:
- If
columnsis not specified, all columns are profiled with table-level defaults - Table-level configurations continue to work as before
- Existing configurations without column configs work unchanged
- Column-level features are opt-in
Best Practices
1. Start Broad, Then Narrow
Begin with table-level configurations, then add column-level configs for specific needs:
# Start here
profiling:
tables:
- table: customers
# Then refine specific columns
profiling:
tables:
- table: customers
columns:
- name: critical_field
drift:
thresholds:
low: 1.0 # Very sensitive
2. Use Patterns for Repeated Configurations
Instead of listing every column, use patterns:
columns:
- name: "*_id"
drift:
enabled: false # IDs shouldn't drift
- name: "*_timestamp"
metrics: [count, min, max] # Timestamps don't need histograms
3. Disable Profiling for Large/Unimportant Columns
Save compute resources by skipping large JSON/text columns:
columns:
- name: raw_json_payload
profiling:
enabled: false
- name: "*_metadata"
profiling:
enabled: false
4. Adjust Sensitivity Based on Business Importance
Use tighter thresholds for critical business metrics:
columns:
- name: revenue
drift:
thresholds:
low: 1.0 # Very sensitive
medium: 3.0
high: 5.0
- name: metadata
drift:
thresholds:
low: 10.0 # More lenient
medium: 20.0
high: 40.0
5. Group Related Columns
Use patterns to configure related columns together:
columns:
- name: "*_email*" # email, email_address, email_verified
metrics: [count, null_count, distinct_count]
drift:
thresholds:
low: 2.0
Troubleshooting
Column Not Being Profiled
Problem: Column specified in config isn't being profiled.
Solutions:
- Check column name spelling (case-sensitive in some databases)
- Verify pattern matches (test with explicit name first)
- Ensure
profiling.enabledis notfalse - Check that column exists in the table
Drift Detection Not Running
Problem: Drift configured but not detecting changes.
Solutions:
- Verify column was actually profiled (check
profiled_columnsin metadata) - Check
drift.enabledis notfalse - Ensure at least 2 profiling runs exist for comparison
- Verify thresholds are appropriate for the data
Warnings About Dependencies
Problem: Warnings about drift/anomaly configured but profiling disabled.
Solutions:
- Remove drift/anomaly config if profiling is intentionally disabled
- Enable profiling if you want drift/anomaly detection
- Review configuration for typos or logic errors
See Also
- Drift Detection Guide - Comprehensive drift detection documentation
- Anomaly Detection Guide - Anomaly detection documentation
- Configuration Reference - Complete configuration schema reference