Prometheus Metrics Integration
Baselinr now includes Prometheus metrics exporting for comprehensive monitoring and observability.
Features
The metrics system provides:
- Profiling metrics: Run counts, duration histograms, row/column counts
- Drift detection metrics: Drift event counts, detection duration
- Schema change metrics: Schema modification tracking
- Query metrics: Warehouse query execution times
- Error metrics: Error tracking by warehouse and component
- Worker metrics: Active worker gauge for concurrency monitoring
Installation
Install the Prometheus client library:
pip install prometheus_client>=0.19.0
Or install from requirements.txt:
pip install -r requirements.txt
Configuration
Add the monitoring section to your Baselinr config file:
monitoring:
enable_metrics: true # Enable Prometheus metrics
port: 9753 # Port for metrics HTTP server (default: 9753)
keep_alive: true # Keep server running after profiling completes (default: true)
Example: examples/config_with_metrics.yml
Available Metrics
Profiling Metrics
baselinr_profile_runs_total (Counter)
- Total number of profiling runs
- Labels:
warehouse,table,status(success/failed)
baselinr_profile_duration_seconds (Histogram)
- Histogram of profile execution times in seconds
- Labels:
warehouse,table - Buckets: 0.1s, 0.5s, 1s, 2.5s, 5s, 10s, 30s, 60s, 120s, 300s
baselinr_rows_profiled_total (Counter)
- Total number of rows profiled
- Labels:
warehouse,table
baselinr_columns_profiled_total (Counter)
- Total number of columns profiled
- Labels:
warehouse,table
baselinr_active_workers (Gauge)
- Number of currently running worker threads
Drift Detection Metrics
baselinr_drift_events_total (Counter)
- Total number of drift detection events
- Labels:
warehouse,table,metric,severity(low/medium/high)
baselinr_drift_detection_duration_seconds (Histogram)
- Histogram of drift detection execution times in seconds
- Labels:
warehouse,table - Buckets: 0.1s, 0.5s, 1s, 2.5s, 5s, 10s, 30s
baselinr_schema_changes_total (Counter)
- Total number of schema change events
- Labels:
warehouse,table,change_type(column_added/column_removed/type_changed)
Query and Error Metrics
baselinr_query_duration_seconds (Histogram)
- Histogram of warehouse query execution times in seconds
- Labels:
warehouse - Buckets: 0.01s, 0.05s, 0.1s, 0.5s, 1s, 2.5s, 5s, 10s
baselinr_errors_total (Counter)
- Total number of errors
- Labels:
warehouse,error_type,component(profiler/drift_detector/connector)
Usage
CLI Usage
Run profiling with metrics enabled:
baselinr profile --config config_with_metrics.yml
The metrics server will start automatically and remain running after profiling completes (by default). Metrics will be available at:
http://localhost:9753/metrics
Press Ctrl+C to stop the metrics server and exit.
Note: If you want the CLI to exit immediately after profiling (e.g., in CI/CD), set keep_alive: false in your config:
monitoring:
enable_metrics: true
port: 9753
keep_alive: false # Exit immediately after profiling
Dagster Usage
When running in Dagster, metrics are automatically collected if enabled in your configuration. The metrics server starts when the Dagster daemon initializes.
Accessing Metrics
Visit the metrics endpoint to see live metrics:
curl http://localhost:9753/metrics
Example output:
# HELP baselinr_profile_runs_total Total number of profiling runs
# TYPE baselinr_profile_runs_total counter
baselinr_profile_runs_total{warehouse="postgres",table="public.customers",status="success"} 3.0
# HELP baselinr_profile_duration_seconds Histogram of profile execution times in seconds
# TYPE baselinr_profile_duration_seconds histogram
baselinr_profile_duration_seconds_bucket{le="0.5",warehouse="postgres",table="public.customers"} 3.0
baselinr_profile_duration_seconds_sum{warehouse="postgres",table="public.customers"} 0.42
baselinr_profile_duration_seconds_count{warehouse="postgres",table="public.customers"} 3.0
# HELP baselinr_drift_events_total Total number of drift detection events
# TYPE baselinr_drift_events_total counter
baselinr_drift_events_total{warehouse="postgres",table="customers",metric="row_count",severity="high"} 2.0
# HELP baselinr_active_workers Number of currently running worker threads
# TYPE baselinr_active_workers gauge
baselinr_active_workers 0.0
Integration with Prometheus
Prometheus Configuration
Add Baselinr as a scrape target in your prometheus.yml:
scrape_configs:
- job_name: 'baselinr'
static_configs:
- targets: ['localhost:9753']
scrape_interval: 15s
Grafana Dashboard
Create custom dashboards to visualize:
- Profiling throughput and latency
- Drift detection patterns over time
- Error rates by warehouse
- Active worker concurrency
- Schema change frequency
Architecture
The metrics system is integrated at multiple layers:
- CLI Layer: Starts metrics server, creates run context with metrics flag
- Profiling Engine: Records profile start/complete/failed events with timing
- Drift Detector: Records drift events, schema changes, and detection timing
- Warehouse Connectors: Records query execution times and errors
- Run Context: Propagates metrics_enabled flag through the pipeline
Troubleshooting
Metrics Server Won't Start
Error: Failed to start metrics server on port 9753: Address already in use
Solution: Another process is using port 9753. Change the port in your config:
monitoring:
enable_metrics: true
port: 9754 # Use a different port
No Metrics Appearing
Check:
- Is
enable_metrics: truein your config? - Is
prometheus_clientinstalled? - Is the metrics server running? Check logs for "Prometheus metrics server started"
- Can you access
http://localhost:9753/metricsin your browser?
Import Error
Error: ModuleNotFoundError: No module named 'prometheus_client'
Solution: Install the Prometheus client:
pip install prometheus_client>=0.19.0
Best Practices
- Enable in Production: Metrics are lightweight and provide valuable observability
- Use Appropriate Port: Ensure port 9753 (or your custom port) is accessible
- Monitor Latency: Watch
profile_duration_secondshistograms for performance issues - Alert on Errors: Set up alerts on
baselinr_errors_total - Track Drift Trends: Use drift metrics to identify data quality issues early
Performance Impact
The metrics system has minimal overhead:
- Metric updates: ~0.001ms per operation
- HTTP server: Single background thread
- Memory: ~10-50MB for metric storage
- Network: ~1-5KB per scrape
Future Enhancements
Planned improvements:
- Dagster-specific metrics (asset materializations, job success rates)
- Custom metric exporters (StatsD, CloudWatch, Datadog)
- Metric aggregation by environment
- SLA tracking and alerting
- Cost metrics (query costs, storage usage)
Related Documentation
For questions or issues, please file a GitHub issue or contact the Baselinr team.