Skip to main content

Smart Column Selection and Check Type Recommendations

Overview

Phase 2 of intelligent selection for baselinr introduces column-level intelligence that automatically suggests appropriate data quality checks based on column characteristics. Building upon Phase 1's usage-based table selection, this feature analyzes column metadata and statistical properties to generate specific, actionable check recommendations per column.

Key Features

  • Automatic Check Inference: Analyzes column names, types, constraints, and statistics to suggest appropriate checks
  • Pattern Recognition: Matches columns against common naming conventions (timestamps, IDs, emails, etc.)
  • Confidence Scoring: Assigns confidence scores to recommendations based on signal strength
  • Pattern Learning: Learns from existing configurations to improve future recommendations
  • CLI Integration: New --columns flag for the recommend command

New Modules

Column Analysis Module (baselinr/smart_selection/column_analysis/)

FilePurpose
metadata_analyzer.pyExtracts column metadata signals (names, types, constraints, keys) using SQLAlchemy inspection
statistical_analyzer.pyAnalyzes historical profiling data for cardinality, distributions, and patterns
pattern_matcher.pyMatches column names against predefined and custom naming convention patterns
check_inferencer.pyMaps column characteristics to appropriate check types using rule-based logic

Scoring Module (baselinr/smart_selection/scoring/)

FilePurpose
confidence_scorer.pyCalculates confidence scores for recommendations based on signal strength
check_prioritizer.pyRanks and filters checks to avoid over-monitoring

Learning Module (baselinr/smart_selection/learning/)

FilePurpose
pattern_learner.pyLearns naming patterns and check preferences from existing configurations
pattern_store.pyPersists and manages learned patterns in YAML format

Supported Check Types

Check TypeTriggered By
completenessPrimary keys, non-nullable columns, required fields
uniquenessPrimary keys, columns with high cardinality
freshnessTimestamp columns (*_at, *_date, updated_*, created_*)
format_emailEmail columns (*email*, *_email)
format_phonePhone columns (*phone*, *_phone)
format_urlURL columns (*url*, *_url, *_link)
format_uuidUUID columns (*uuid*, *_guid)
non_negativeMonetary/count columns (*amount*, *price*, *count*)
rangeNumeric columns with observed min/max values
allowed_valuesLow cardinality categorical columns
referential_integrityForeign key columns
valid_jsonJSON/JSONB columns

Configuration

Column Selection Settings

smart_selection:
enabled: true

# Table selection (Phase 1)
tables:
mode: "recommend"
# ... existing table selection config

# Column selection (Phase 2 - NEW)
columns:
enabled: true
mode: "recommend" # Options: recommend | auto | disabled

inference:
use_profiling_data: true # Use existing profile stats
confidence_threshold: 0.7 # Minimum confidence to recommend
max_checks_per_column: 3 # Avoid over-monitoring

# Column prioritization
prioritize:
primary_keys: true
foreign_keys: true
timestamp_columns: true
high_cardinality_strings: false

# Check type preferences
preferred_checks:
- completeness
- freshness
- uniqueness

avoided_checks:
- custom_sql

# Custom pattern overrides
patterns:
- match: "*_email"
checks:
- type: format_email
confidence: 0.95

- match: "revenue_*"
checks:
- type: non_negative
confidence: 0.9
- type: distribution
confidence: 0.8

# Pattern learning settings
learning:
enabled: true
min_occurrences: 2
storage_path: ".baselinr_patterns.yaml"

CLI Usage

Generate Column Recommendations

# Recommend checks for all columns in recommended/configured tables
baselinr recommend --columns --config config.yaml

# Recommend for a specific table
baselinr recommend --columns --table analytics.user_events --config config.yaml

# Show detailed reasoning for recommendations
baselinr recommend --columns --explain --config config.yaml

# Preview changes without applying
baselinr recommend --columns --dry-run --config config.yaml

# Apply column recommendations to config
baselinr recommend --columns --apply --config config.yaml

Example Output

Analyzing 15 recommended tables...
Analyzing columns in analytics.user_events (45 columns)...
Analyzing columns in analytics.transactions (32 columns)...

Generated 247 column check recommendations across 15 tables
- High confidence: 156 (63%)
- Medium confidence: 71 (29%)
- Low confidence: 20 (8%)

Output saved to: recommendations.yaml

Review recommendations with: baselinr recommend --columns --explain
Apply recommendations with: baselinr recommend --columns --apply

Explain Mode Output

Table: analytics.user_events
45 columns analyzed, 23 checks recommended

HIGH CONFIDENCE RECOMMENDATIONS:
✓ event_id (varchar)
→ uniqueness check (confidence: 0.98)
Reason: Primary key pattern, 100% distinct values
→ completeness check (confidence: 0.95)
Reason: Critical identifier field

✓ event_timestamp (timestamp)
→ freshness check (confidence: 0.98)
Reason: Timestamp column, updated continuously
→ completeness check (confidence: 0.95)
Reason: Required temporal marker

MEDIUM CONFIDENCE RECOMMENDATIONS:
◐ user_email (varchar)
→ format_email check (confidence: 0.72)
Reason: Email pattern match in column name

Recommendation Output Format

When saved to recommendations.yaml:

recommended_tables:
- schema: analytics
table: user_events
confidence: 0.95
reasons:
- "Queried 1,247 times in last 30 days"

column_recommendations:
- column: event_id
data_type: varchar
confidence: 0.95
signals:
- "Column name matches pattern: *_id"
- "Primary key indicator"
- "100% unique values"
suggested_checks:
- type: uniqueness
confidence: 0.98
config:
threshold: 1.0
- type: completeness
confidence: 0.95
config:
min_completeness: 1.0

- column: user_email
data_type: varchar
confidence: 0.85
signals:
- "Email pattern match"
suggested_checks:
- type: format_email
confidence: 0.92
config:
pattern: "email"

low_confidence_suggestions:
- schema: analytics
table: user_events
column: metadata_json
data_type: json
confidence: 0.45
signals:
- "JSON column detected"
suggested_checks:
- type: valid_json
confidence: 0.60
note: "Consider manual inspection to define schema validation"

Pattern Learning

The system can learn from your existing configurations to improve recommendations:

# Learned patterns stored in .baselinr_patterns.yaml
learned_patterns:
- pattern: "*_at"
pattern_type: suffix
suggested_checks:
- freshness
- completeness
confidence: 0.95
occurrence_count: 15
source_columns:
- created_at
- updated_at
- deleted_at

- pattern: "is_*"
pattern_type: prefix
suggested_checks:
- completeness
confidence: 0.88
occurrence_count: 8

Confidence Scoring

Confidence scores are calculated based on:

FactorWeightDescription
Metadata signals30%Column name, type, constraints
Statistical signals30%Cardinality, null rate, distributions
Pattern matches25%Naming convention matches
Consistency15%Historical stability of values

Confidence Levels:

  • High (0.8-1.0): Strong signals, safe to auto-apply
  • Medium (0.5-0.8): Reasonable signals, user validation recommended
  • Low (0.3-0.5): Weak signals, manual review required

Check Prioritization

To avoid over-monitoring, checks are prioritized by:

  1. Column Importance: Primary keys > Foreign keys > Timestamps > Regular columns
  2. Check Value: Uniqueness/Completeness > Freshness > Format validation > Distribution
  3. Confidence Score: Higher confidence checks ranked first
  4. User Preferences: Preferred checks boosted, avoided checks filtered

Default limits:

  • Maximum 5 checks per column
  • Maximum 50 checks per table
  • Minimum 0.5 confidence threshold

Integration with Existing Config

Column recommendations integrate seamlessly with existing configurations:

  • Explicit configs take precedence: User-defined column checks are never overwritten
  • Conflict warnings: Alerts when recommendations conflict with existing checks
  • Exclusion support: Columns can be excluded via exclude_from_recommendations
  • Partial acceptance: Accept some recommendations, reject others

Tests

The implementation includes comprehensive unit tests:

Test FileTestsCoverage
test_column_analysis.py28Metadata, patterns, statistics, inference
test_column_scoring.py15Confidence scoring, prioritization
test_column_learning.py20Pattern learning, storage
test_column_recommendation_integration.py13End-to-end flows

Run tests:

pytest tests/test_column_analysis.py tests/test_column_scoring.py \
tests/test_column_learning.py tests/test_column_recommendation_integration.py -v

Architecture

baselinr/smart_selection/
├── column_analysis/
│ ├── __init__.py
│ ├── metadata_analyzer.py # SQLAlchemy-based metadata extraction
│ ├── statistical_analyzer.py # Profiling data analysis
│ ├── pattern_matcher.py # Naming convention matching
│ └── check_inferencer.py # Check type inference rules
├── scoring/
│ ├── __init__.py
│ ├── confidence_scorer.py # Confidence calculation
│ └── check_prioritizer.py # Check ranking and filtering
├── learning/
│ ├── __init__.py
│ ├── pattern_learner.py # Learn from existing configs
│ └── pattern_store.py # Persist learned patterns
├── config.py # Extended with column settings
├── recommender.py # Integrated column recommendations
└── __init__.py # Updated exports

Future Enhancements

Potential future improvements:

  • Industry profiles: Pre-built patterns for finance, healthcare, e-commerce
  • Composite checks: Cross-column validations (e.g., start_date < end_date)
  • Severity levels: Recommend error vs. warning severity
  • ML-based inference: Learn check patterns from accepted/rejected recommendations
  • External rule libraries: Integration with community-contributed rule sets

Migration from Phase 1

No migration required. Phase 2 is additive:

  • All Phase 1 table recommendations continue to work unchanged
  • Column recommendations are opt-in via --columns flag
  • Existing configurations are not modified unless --apply is used
  • The columns section in smart_selection config is optional

Troubleshooting

No column recommendations generated

  1. Ensure smart_selection.columns.enabled: true in config
  2. Check that tables exist and are accessible
  3. Verify confidence threshold isn't too high

Low confidence scores

  1. Run profiling first to generate statistics: baselinr run --config config.yaml
  2. Lower confidence_threshold in config
  3. Add custom patterns for domain-specific columns

Pattern learning not working

  1. Ensure learning.enabled: true in config
  2. Check that .baselinr_patterns.yaml is writable
  3. Verify min_occurrences threshold is met

References