Prometheus Metrics Integration

Baselinr now includes Prometheus metrics exporting for comprehensive monitoring and observability.

Features

The metrics system provides:

Profiling metrics: Run counts, duration histograms, row/column counts
Drift detection metrics: Drift event counts, detection duration
Schema change metrics: Schema modification tracking
Query metrics: Warehouse query execution times
Error metrics: Error tracking by warehouse and component
Worker metrics: Active worker gauge for concurrency monitoring

Installation

Install the Prometheus client library:

pip install prometheus_client>=0.19.0

Or install from requirements.txt:

pip install -r requirements.txt

Configuration

Add the monitoring section to your Baselinr config file:

monitoring:
  enable_metrics: true  # Enable Prometheus metrics
  port: 9753            # Port for metrics HTTP server (default: 9753)
  keep_alive: true      # Keep server running after profiling completes (default: true)

Example: examples/config_with_metrics.yml

Available Metrics

Profiling Metrics

baselinr_profile_runs_total (Counter)

Total number of profiling runs
Labels: warehouse, table, status (success/failed)

baselinr_profile_duration_seconds (Histogram)

Histogram of profile execution times in seconds
Labels: warehouse, table
Buckets: 0.1s, 0.5s, 1s, 2.5s, 5s, 10s, 30s, 60s, 120s, 300s

baselinr_rows_profiled_total (Counter)

Total number of rows profiled
Labels: warehouse, table

baselinr_columns_profiled_total (Counter)

Total number of columns profiled
Labels: warehouse, table

baselinr_active_workers (Gauge)

Number of currently running worker threads

Drift Detection Metrics

baselinr_drift_events_total (Counter)

Total number of drift detection events
Labels: warehouse, table, metric, severity (low/medium/high)

baselinr_drift_detection_duration_seconds (Histogram)

Histogram of drift detection execution times in seconds
Labels: warehouse, table
Buckets: 0.1s, 0.5s, 1s, 2.5s, 5s, 10s, 30s

baselinr_schema_changes_total (Counter)

Total number of schema change events
Labels: warehouse, table, change_type (column_added/column_removed/type_changed)

Query and Error Metrics

baselinr_query_duration_seconds (Histogram)

Histogram of warehouse query execution times in seconds
Labels: warehouse
Buckets: 0.01s, 0.05s, 0.1s, 0.5s, 1s, 2.5s, 5s, 10s

baselinr_errors_total (Counter)

Total number of errors
Labels: warehouse, error_type, component (profiler/drift_detector/connector)

Usage

CLI Usage

Run profiling with metrics enabled:

baselinr profile --config config_with_metrics.yml

The metrics server will start automatically and remain running after profiling completes (by default). Metrics will be available at:

http://localhost:9753/metrics

Press Ctrl+C to stop the metrics server and exit.

Note: If you want the CLI to exit immediately after profiling (e.g., in CI/CD), set keep_alive: false in your config:

monitoring:
  enable_metrics: true
  port: 9753
  keep_alive: false  # Exit immediately after profiling

Dagster Usage

When running in Dagster, metrics are automatically collected if enabled in your configuration. The metrics server starts when the Dagster daemon initializes.

Accessing Metrics

Visit the metrics endpoint to see live metrics:

curl http://localhost:9753/metrics

Example output:

# HELP baselinr_profile_runs_total Total number of profiling runs
# TYPE baselinr_profile_runs_total counter
baselinr_profile_runs_total{warehouse="postgres",table="public.customers",status="success"} 3.0

# HELP baselinr_profile_duration_seconds Histogram of profile execution times in seconds
# TYPE baselinr_profile_duration_seconds histogram
baselinr_profile_duration_seconds_bucket{le="0.5",warehouse="postgres",table="public.customers"} 3.0
baselinr_profile_duration_seconds_sum{warehouse="postgres",table="public.customers"} 0.42
baselinr_profile_duration_seconds_count{warehouse="postgres",table="public.customers"} 3.0

# HELP baselinr_drift_events_total Total number of drift detection events
# TYPE baselinr_drift_events_total counter
baselinr_drift_events_total{warehouse="postgres",table="customers",metric="row_count",severity="high"} 2.0

# HELP baselinr_active_workers Number of currently running worker threads
# TYPE baselinr_active_workers gauge
baselinr_active_workers 0.0

Integration with Prometheus

Prometheus Configuration

Add Baselinr as a scrape target in your prometheus.yml:

scrape_configs:
  - job_name: 'baselinr'
    static_configs:
      - targets: ['localhost:9753']
    scrape_interval: 15s

Grafana Dashboard

Create custom dashboards to visualize:

Profiling throughput and latency
Drift detection patterns over time
Error rates by warehouse
Active worker concurrency
Schema change frequency

Architecture

The metrics system is integrated at multiple layers:

CLI Layer: Starts metrics server, creates run context with metrics flag
Profiling Engine: Records profile start/complete/failed events with timing
Drift Detector: Records drift events, schema changes, and detection timing
Warehouse Connectors: Records query execution times and errors
Run Context: Propagates metrics_enabled flag through the pipeline

Troubleshooting

Metrics Server Won't Start

Error: Failed to start metrics server on port 9753: Address already in use

Solution: Another process is using port 9753. Change the port in your config:

monitoring:
  enable_metrics: true
  port: 9754  # Use a different port

No Metrics Appearing

Check:

Is enable_metrics: true in your config?
Is prometheus_client installed?
Is the metrics server running? Check logs for "Prometheus metrics server started"
Can you access http://localhost:9753/metrics in your browser?

Import Error

Error: ModuleNotFoundError: No module named 'prometheus_client'

Solution: Install the Prometheus client:

pip install prometheus_client>=0.19.0

Best Practices

Enable in Production: Metrics are lightweight and provide valuable observability
Use Appropriate Port: Ensure port 9753 (or your custom port) is accessible
Monitor Latency: Watch profile_duration_seconds histograms for performance issues
Alert on Errors: Set up alerts on baselinr_errors_total
Track Drift Trends: Use drift metrics to identify data quality issues early

Performance Impact

The metrics system has minimal overhead:

Metric updates: ~0.001ms per operation
HTTP server: Single background thread
Memory: ~10-50MB for metric storage
Network: ~1-5KB per scrape

Future Enhancements

Planned improvements:

Dagster-specific metrics (asset materializations, job success rates)
Custom metric exporters (StatsD, CloudWatch, Datadog)
Metric aggregation by environment
SLA tracking and alerting
Cost metrics (query costs, storage usage)

For questions or issues, please file a GitHub issue or contact the Baselinr team.

Features​

Installation​

Configuration​

Available Metrics​

Profiling Metrics​

Drift Detection Metrics​

Query and Error Metrics​

Usage​

CLI Usage​

Dagster Usage​

Accessing Metrics​

Integration with Prometheus​

Prometheus Configuration​

Grafana Dashboard​

Architecture​

Troubleshooting​

Metrics Server Won't Start​

No Metrics Appearing​

Import Error​

Best Practices​

Performance Impact​

Future Enhancements​

Related Documentation​