Performance Tuning Guide

Optimization guide for large-scale profiling and improved performance.

Overview
Profiling Performance
Sampling Strategies
Partition Strategies
Parallelism Configuration
Incremental Profiling
Database Optimization
Memory Optimization
Network Optimization
Monitoring and Metrics

Overview

This guide covers optimization strategies for improving Baselinr performance at scale. Key areas include:

Profiling speed: Reduce time to profile large tables
Resource usage: Minimize CPU, memory, and network usage
Cost optimization: Reduce database compute costs
Scalability: Handle large numbers of tables efficiently

Profiling Performance

Reduce Metrics Computed

Limit the number of metrics computed to improve performance.

Strategy:

Remove expensive metrics for large tables:

profiling:
  tables:
    - table: large_table
      # Only essential metrics
      metrics:
        - count
        - null_ratio
        - distinct_count
        # Skip: histogram, mean, stddev for large tables

Use different metric sets per table:

profiling:
  tables:
    - table: critical_table
      # Full metrics
      metrics:
        - count
        - null_ratio
        - distinct_count
        - mean
        - stddev
        - histogram
    - table: large_table
      # Minimal metrics
      metrics:
        - count
        - null_ratio
        - distinct_count

Performance Impact:

Removing histogram: 20-30% faster
Removing mean/stddev: 10-15% faster
Keeping only essential metrics: 40-50% faster

Limit Distinct Values

Reduce max_distinct_values for high-cardinality columns.

Strategy:

profiling:
  max_distinct_values: 100  # Default is 1000

When to Use:

Tables with many high-cardinality columns
When exact distinct counts aren't critical
When you only need approximate counts

Performance Impact:

Reduces query time for high-cardinality columns by 30-50%

Disable Histograms When Not Needed

Histograms are expensive to compute but provide valuable distribution information.

Strategy:

profiling:
  compute_histograms: false  # Disable for all tables

Or per-table:

profiling:
  tables:
    - table: large_table
      # Override to disable histograms
      metadata:
        compute_histograms: false

When to Disable:

Very large tables (>100M rows)
When distribution information isn't needed
During initial profiling to establish baselines

Performance Impact:

20-40% faster profiling for tables with many numeric columns

Sampling Strategies

Random Sampling

Use random sampling for general-purpose profiling.

Configuration:

profiling:
  tables:
    - table: very_large_table
      sampling:
        enabled: true
        method: random
        fraction: 0.01  # 1% sample
        max_rows: 1000000  # Cap at 1M rows

Best Practices:

Use 1-5% fraction for very large tables
Set max_rows to prevent excessive sampling time
Higher fraction (5-10%) for smaller large tables

Performance Impact:

1% sample: 100x faster for row count and basic metrics
Trade-off: Slight accuracy loss, but statistically valid

Stratified Sampling

Use stratified sampling for skewed distributions.

Configuration:

profiling:
  tables:
    - table: skewed_table
      sampling:
        enabled: true
        method: stratified
        fraction: 0.05  # 5% sample
        max_rows: 2000000

When to Use:

Tables with highly skewed distributions
When you need to capture rare values
Categorical columns with long tails

Performance Impact:

Similar to random sampling but better coverage of rare values

Sampling Guidelines

Table Size	Recommended Fraction	Max Rows
< 1M rows	1.0 (no sampling)	N/A
1M - 10M rows	0.10 (10%)	1M
10M - 100M rows	0.05 (5%)	2M
100M - 1B rows	0.01 (1%)	5M
> 1B rows	0.005 (0.5%)	10M

Partition Strategies

Latest Partition Only

Profile only the latest partition for time-series data.

Configuration:

profiling:
  tables:
    - table: events
      partition:
        key: event_date
        strategy: latest

Performance Impact:

Dramatically faster for partitioned tables
Only profiles recent data (good for drift detection)

Recent N Partitions

Profile a rolling window of partitions.

Configuration:

profiling:
  tables:
    - table: events
      partition:
        key: event_date
        strategy: recent_n
        recent_n: 7  # Last 7 days

Performance Impact:

Faster than profiling all partitions
Maintains some historical context

Best Practices:

Use recent_n based on partition frequency:
- Daily partitions: recent_n: 7 (last week)
- Hourly partitions: recent_n: 24 (last day)
- Monthly partitions: recent_n: 3 (last quarter)

Parallelism Configuration

Enabling Parallelism

Enable parallel profiling for multiple tables.

Configuration:

execution:
  max_workers: 4  # Process 4 tables in parallel
  batch_size: 10
  queue_size: 100

Performance Impact:

Linear speedup up to database concurrency limit
4 workers: ~3.5x faster (accounting for overhead)

Warehouse-Specific Limits

Set different worker limits per database type.

Configuration:

execution:
  max_workers: 8  # Default limit
  warehouse_limits:
    snowflake: 20    # Snowflake can handle more
    postgres: 8      # Postgres moderate concurrency
    mysql: 6         # MySQL lower concurrency
    sqlite: 1        # SQLite single writer (no parallelism)
    bigquery: 10     # BigQuery query limits
    redshift: 8      # Redshift moderate concurrency

Best Practices:

Start conservative and scale up
Monitor database load and query queue
Respect database connection limits
Use read replicas if available

Performance Impact:

Optimal worker count depends on database capacity
Too many workers can cause contention and slowdown

Finding Optimal Worker Count

Test different worker counts to find the optimal setting.

Process:

Start with max_workers: 2
Profile a representative set of tables
Measure total time
Increase workers and repeat
Stop when performance plateaus or degrades

Example:

# Test configuration
execution:
  max_workers: 2  # Start here
  # ... profile and measure

# Then test
execution:
  max_workers: 4
  # ... profile and measure

# Continue until optimal

Incremental Profiling

Enable Incremental Profiling

Skip unchanged tables to improve performance.

Configuration:

incremental:
  enabled: true
  change_detection:
    enabled: true
    metadata_table: baselinr_table_state
    snapshot_ttl_minutes: 1440  # 24 hours

Performance Impact:

Skip 80-90% of tables that haven't changed
10-20x faster for typical production workloads

Best Practices:

Enable for production deployments
Set appropriate TTL based on update frequency
Monitor false negatives (unchanged tables that should be profiled)

Change Detection

Optimize change detection metadata cache.

Configuration:

incremental:
  change_detection:
    enabled: true
    snapshot_ttl_minutes: 1440  # Cache for 24 hours
    connector_overrides:
      snowflake:
        # Snowflake-specific optimizations
        cache_partition_info: true

Performance Impact:

Faster change detection with cached metadata
Reduces database queries for metadata

Partial Profiling

Enable partial profiling for partition-aware optimization.

Configuration:

incremental:
  partial_profiling:
    enabled: true
    allow_partition_pruning: true
    max_partitions_per_run: 64

Performance Impact:

Only profile changed partitions
Significantly faster for partitioned tables

Database Optimization

Connection Pooling

Configure connection pooling appropriately.

Best Practices:

Use appropriate pool size based on worker count
Set reasonable connection timeouts
Monitor connection pool usage

Performance Impact:

Reduces connection overhead
Better connection reuse

Query Optimization

Optimize database queries where possible.

Strategies:

Use indexed columns for partition keys
Create indexes on frequently queried metadata tables
Use appropriate query timeouts

Example (PostgreSQL):

-- Index metadata tables for faster queries
CREATE INDEX idx_runs_profiled_at ON baselinr_runs(profiled_at DESC);
CREATE INDEX idx_results_run_id ON baselinr_results(run_id);
CREATE INDEX idx_results_dataset ON baselinr_results(dataset_name);

Read Replicas

Use read replicas for profiling to reduce load on primary database.

Configuration:

# Profile from read replica
source:
  type: postgres
  host: read-replica.example.com  # Use replica
  
# Store results in primary
storage:
  connection:
    type: postgres
    host: primary.example.com  # Use primary

Performance Impact:

Reduces load on primary database
Better performance isolation

Memory Optimization

Reduce Memory Footprint

Optimize memory usage for large tables.

Strategies:

Use sampling to reduce data scanned
Process tables sequentially instead of parallel (if memory-constrained)
Reduce metrics computed
Limit distinct values

Configuration:

# Memory-optimized configuration
execution:
  max_workers: 1  # Sequential processing
profiling:
  max_distinct_values: 100  # Reduce memory usage
  compute_histograms: false  # Skip expensive metrics

Streaming Results

Use streaming for very large result sets.

Best Practices:

Results are streamed to storage automatically
Don't load all results into memory at once
Use query limits when querying results

Network Optimization

Reduce Network Latency

Optimize for network performance.

Strategies:

Co-locate Baselinr with database (same region/VPC)
Use connection pooling
Minimize round trips
Compress large payloads (if database supports)

Performance Impact:

Lower latency improves overall performance
Especially important for cloud databases

Batch Operations

Use batch operations where possible.

Best Practices:

Baselinr batches writes automatically
Configure batch_size appropriately
Monitor batch performance

Monitoring and Metrics

Enable Prometheus Metrics

Monitor performance metrics.

Configuration:

monitoring:
  enable_metrics: true
  port: 9753
  keep_alive: true

Key Metrics to Monitor:

Profiling duration per table
Database query time
Memory usage
Parallel worker utilization
Error rates

Performance Baselines

Establish performance baselines.

Process:

Profile representative tables
Measure key metrics:
- Total profiling time
- Time per table
- Database query time
- Memory usage
Document baselines
Monitor for regressions

Example:

Baseline Performance (100 tables):
- Total time: 45 minutes
- Average time per table: 27 seconds
- Database query time: 20 seconds/table
- Memory usage: 2GB peak

Performance Regression Detection

Set up alerts for performance regressions.

Strategies:

Alert if profiling time exceeds baseline by 50%
Alert if memory usage exceeds thresholds
Alert on error rate increases
Monitor database load during profiling

Best Practices Guide - General best practices
Sampling Guide - Detailed sampling strategies
Parallelism Guide - Parallel execution details
Incremental Profiling Guide - Incremental profiling details

Table of Contents​

Overview​

Profiling Performance​

Reduce Metrics Computed​

Limit Distinct Values​

Disable Histograms When Not Needed​

Sampling Strategies​

Random Sampling​

Stratified Sampling​

Sampling Guidelines​

Partition Strategies​

Latest Partition Only​

Recent N Partitions​

Parallelism Configuration​

Enabling Parallelism​

Warehouse-Specific Limits​

Finding Optimal Worker Count​

Incremental Profiling​

Enable Incremental Profiling​

Change Detection​

Partial Profiling​

Database Optimization​

Connection Pooling​

Query Optimization​

Read Replicas​

Memory Optimization​

Reduce Memory Footprint​

Streaming Results​

Network Optimization​

Reduce Network Latency​

Batch Operations​

Monitoring and Metrics​

Enable Prometheus Metrics​

Performance Baselines​

Performance Regression Detection​

Related Documentation​

Table of Contents

Overview

Profiling Performance

Reduce Metrics Computed

Limit Distinct Values

Disable Histograms When Not Needed

Sampling Strategies

Random Sampling

Stratified Sampling

Sampling Guidelines

Partition Strategies

Latest Partition Only

Recent N Partitions

Parallelism Configuration

Enabling Parallelism

Warehouse-Specific Limits

Finding Optimal Worker Count

Incremental Profiling

Enable Incremental Profiling

Change Detection

Partial Profiling

Database Optimization

Connection Pooling

Query Optimization

Read Replicas

Memory Optimization

Reduce Memory Footprint

Streaming Results

Network Optimization

Reduce Network Latency

Batch Operations

Monitoring and Metrics

Enable Prometheus Metrics

Performance Baselines

Performance Regression Detection

Related Documentation