Skip to main content

Retry & Recovery - Quick Start

What's Been Addedโ€‹

Baselinr now automatically retries transient warehouse failures with exponential backoff, structured logging, event emission, and Prometheus metrics.

๐Ÿš€ Quick Startโ€‹

1. Configuration (Already Enabled!)โ€‹

Retry is enabled by default with sensible defaults:

retry:
enabled: true # Already on!
retries: 3 # Max retry attempts
backoff_strategy: exponential
min_backoff: 0.5 # 0.5 seconds
max_backoff: 8.0 # 8 seconds max

2. Run Profilingโ€‹

# Retry automatically handles transient errors
baselinr profile --config examples/config.yml

3. Check Logsโ€‹

Watch for retry attempts in logs:

{
"event": "retry_attempt",
"level": "warning",
"attempt": 2,
"error": "Connection timeout",
"backoff_seconds": 1.12
}

๐ŸŽฏ What Gets Retriedโ€‹

โœ… Automatic Retry (Transient Errors)โ€‹

  • Connection timeouts
  • Connection lost/reset
  • Network errors
  • Rate limits
  • Deadlocks
  • Temporary unavailability

โŒ No Retry (Permanent Errors)โ€‹

  • Syntax errors
  • Authentication failures
  • Permission denied
  • Table not found
  • Data type errors

๐Ÿ“Š Monitoringโ€‹

Structured Logsโ€‹

# Watch retry activity
tail -f baselinr.log | grep retry_attempt

Prometheus Metricsโ€‹

# Rate of retries
rate(baselinr_warehouse_transient_errors_total[5m])

Event Busโ€‹

# Subscribe to retry events
@event_bus.subscribe("retry_attempt")
def handle_retry(event):
print(f"Retry: {event.metadata['function']}")

๐Ÿ”ง Tuning for Your Environmentโ€‹

Production (Stable Warehouse)โ€‹

retry:
retries: 3
min_backoff: 0.5
max_backoff: 8.0

Production (Flaky Network)โ€‹

retry:
retries: 5
min_backoff: 1.0
max_backoff: 16.0

Development/Testingโ€‹

retry:
retries: 1
min_backoff: 0.1
max_backoff: 1.0

Disable Retryโ€‹

retry:
enabled: false

๐Ÿ“– Full Documentationโ€‹

Comprehensive Guide: docs/guides/RETRY_AND_RECOVERY.md

Topics Covered:

  • Configuration options
  • Error classification details
  • Observability (logs, events, metrics)
  • Best practices
  • Troubleshooting
  • Performance impact
  • Programmatic usage

๐Ÿงช Testingโ€‹

# Run retry tests
pytest tests/utils/test_retry.py -v

# Test with your config
baselinr profile --config examples/config.yml

๐Ÿ’ก Key Featuresโ€‹

โœ… Exponential backoff - Delays increase: 0.5s โ†’ 1s โ†’ 2s โ†’ 4s โ†’ 8s
โœ… Jitter - ยฑ15% randomization prevents thundering herd
โœ… Intelligent classification - Distinguishes transient from permanent errors
โœ… Graceful degradation - Failed tables don't abort the run
โœ… Full observability - Logs + Events + Metrics
โœ… Zero configuration - Works out of the box

๐Ÿค” Common Questionsโ€‹

Q: Will retry slow down my profiling?
A: Only if errors occur. Successful operations have ~0.1ms overhead (negligible).

Q: Can I retry specific operations only?
A: Retry applies to all warehouse operations. Use enabled: false to disable globally.

Q: How do I know if retry is working?
A: Check logs for retry_attempt events or monitor the baselinr_warehouse_transient_errors_total metric.

Q: What if I want different retry config per table?
A: Currently retry config is global. Per-table config is a future enhancement.

๐Ÿ†˜ Troubleshootingโ€‹

Problem: Profiling taking too long
Solution: Reduce retries or max_backoff

Problem: Still getting connection errors
Solution: Increase retries or check warehouse health

Problem: Want to see retry in action
Solution: Set log level to DEBUG and watch for retry events


Ready to use! Retry is enabled by default. Your profiling is now more resilient. ๐ŸŽ‰