Retry & Recovery - Quick Start
What's Been Addedโ
Baselinr now automatically retries transient warehouse failures with exponential backoff, structured logging, event emission, and Prometheus metrics.
๐ Quick Startโ
1. Configuration (Already Enabled!)โ
Retry is enabled by default with sensible defaults:
retry:
enabled: true # Already on!
retries: 3 # Max retry attempts
backoff_strategy: exponential
min_backoff: 0.5 # 0.5 seconds
max_backoff: 8.0 # 8 seconds max
2. Run Profilingโ
# Retry automatically handles transient errors
baselinr profile --config examples/config.yml
3. Check Logsโ
Watch for retry attempts in logs:
{
"event": "retry_attempt",
"level": "warning",
"attempt": 2,
"error": "Connection timeout",
"backoff_seconds": 1.12
}
๐ฏ What Gets Retriedโ
โ Automatic Retry (Transient Errors)โ
- Connection timeouts
- Connection lost/reset
- Network errors
- Rate limits
- Deadlocks
- Temporary unavailability
โ No Retry (Permanent Errors)โ
- Syntax errors
- Authentication failures
- Permission denied
- Table not found
- Data type errors
๐ Monitoringโ
Structured Logsโ
# Watch retry activity
tail -f baselinr.log | grep retry_attempt
Prometheus Metricsโ
# Rate of retries
rate(baselinr_warehouse_transient_errors_total[5m])
Event Busโ
# Subscribe to retry events
@event_bus.subscribe("retry_attempt")
def handle_retry(event):
print(f"Retry: {event.metadata['function']}")
๐ง Tuning for Your Environmentโ
Production (Stable Warehouse)โ
retry:
retries: 3
min_backoff: 0.5
max_backoff: 8.0
Production (Flaky Network)โ
retry:
retries: 5
min_backoff: 1.0
max_backoff: 16.0
Development/Testingโ
retry:
retries: 1
min_backoff: 0.1
max_backoff: 1.0
Disable Retryโ
retry:
enabled: false
๐ Full Documentationโ
Comprehensive Guide: docs/guides/RETRY_AND_RECOVERY.md
Topics Covered:
- Configuration options
- Error classification details
- Observability (logs, events, metrics)
- Best practices
- Troubleshooting
- Performance impact
- Programmatic usage
๐งช Testingโ
# Run retry tests
pytest tests/utils/test_retry.py -v
# Test with your config
baselinr profile --config examples/config.yml
๐ก Key Featuresโ
โ
Exponential backoff - Delays increase: 0.5s โ 1s โ 2s โ 4s โ 8s
โ
Jitter - ยฑ15% randomization prevents thundering herd
โ
Intelligent classification - Distinguishes transient from permanent errors
โ
Graceful degradation - Failed tables don't abort the run
โ
Full observability - Logs + Events + Metrics
โ
Zero configuration - Works out of the box
๐ค Common Questionsโ
Q: Will retry slow down my profiling?
A: Only if errors occur. Successful operations have ~0.1ms overhead (negligible).
Q: Can I retry specific operations only?
A: Retry applies to all warehouse operations. Use enabled: false to disable globally.
Q: How do I know if retry is working?
A: Check logs for retry_attempt events or monitor the baselinr_warehouse_transient_errors_total metric.
Q: What if I want different retry config per table?
A: Currently retry config is global. Per-table config is a future enhancement.
๐ Troubleshootingโ
Problem: Profiling taking too long
Solution: Reduce retries or max_backoff
Problem: Still getting connection errors
Solution: Increase retries or check warehouse health
Problem: Want to see retry in action
Solution: Set log level to DEBUG and watch for retry events
๐ Related Docsโ
Ready to use! Retry is enabled by default. Your profiling is now more resilient. ๐