Parallelism and Batching Guide

This guide explains how to use Baselinr's optional parallelism and batching features to speed up profiling operations.

Overview
When to Use Parallelism
Configuration
How It Works
Warehouse-Specific Considerations
Dagster Integration
Performance Impact
Troubleshooting

Overview

Baselinr supports optional parallel execution of profiling tasks. This feature is:

Opt-in: Defaults to sequential execution (max_workers=1)
CLI-focused: Primary benefit is for CLI users profiling many tables
Dagster-compatible: Works with Dagster but not required (Dagster already provides asset-level parallelism)
Backward compatible: Existing configurations work unchanged

Key Design Decision:

Default: max_workers=1 (sequential execution, maintains current behavior)
Opt-in: Users enable parallelism via configuration
Primary use case: CLI execution where users want to profile many tables faster

When to Use Parallelism

✅ Use Parallelism When:

CLI execution with many tables (10+ tables)
Profiling large warehouses (hundreds of tables)
Tables are independent (no cross-table dependencies)
Warehouse can handle concurrent connections
Want to reduce total profiling time

Example: Profiling 50 tables sequentially takes ~50 minutes. With 8 workers, it takes ~7 minutes.

❌ Don't Use Parallelism When:

Dagster users with one table per asset (Dagster already parallelizes at the asset level)
Using SQLite (single writer limitation)
Warehouse has connection limits
Profiling only 1-2 tables (no benefit)
Debugging profiling issues (sequential is easier to debug)

Configuration

Basic Configuration

Add the execution section to your config.yml:

# Execution and parallelism configuration (OPTIONAL)
# Default: max_workers=1 (sequential execution)
# Enable parallelism by setting max_workers > 1
execution:
  max_workers: 8          # Number of concurrent workers
  batch_size: 10          # Tables per batch (default: 10)
  queue_size: 100         # Maximum queue size (default: 100)

Warehouse-Specific Limits

You can set different worker limits for different warehouse types:

execution:
  max_workers: 16         # Default for all warehouses
  
  # Warehouse-specific overrides
  warehouse_limits:
    snowflake: 20         # Snowflake can handle high concurrency
    postgres: 8           # Postgres moderate concurrency
    mysql: 8              # MySQL moderate concurrency
    redshift: 10          # Redshift moderate-high concurrency
    bigquery: 15          # BigQuery high concurrency
    sqlite: 1             # SQLite single writer only

Configuration Parameters

Parameter	Default	Range	Description
`max_workers`	1	1-64	Number of concurrent worker threads
`batch_size`	10	1-100	Tables per batch (for future use)
`queue_size`	100	10-1000	Maximum task queue size
`warehouse_limits`		-	Per-warehouse worker limits

How It Works

Sequential Execution (Default)

When max_workers=1 (default), Baselinr profiles tables one at a time:

Table 1 → Table 2 → Table 3 → ... → Table N

Characteristics:

Simple, predictable
Easy to debug
No connection pool pressure
Existing behavior (backward compatible)

Parallel Execution

When max_workers > 1, Baselinr uses a thread pool to profile multiple tables concurrently:

Worker 1: Table 1 → Table 5 → Table 9
Worker 2: Table 2 → Table 6 → Table 10
Worker 3: Table 3 → Table 7 → Table 11
Worker 4: Table 4 → Table 8 → Table 12

Characteristics:

Faster total time
Higher database load
Requires larger connection pool
Error isolation (one table failure doesn't stop others)

Connection Pool Sizing

Baselinr automatically adjusts the database connection pool based on max_workers:

Sequential (max_workers=1): pool_size=5, max_overflow=10
Parallel (max_workers=8): pool_size=10 (8+2), max_overflow=8
Capped at 20: pool_size never exceeds 20

Warehouse-Specific Considerations

Snowflake

Recommended max_workers: 10-20
Notes: Handles high concurrency well, scales horizontally
Connection limit: Check your warehouse size and tier

execution:
  max_workers: 20
  warehouse_limits:
    snowflake: 20

Postgres

Recommended max_workers: 4-8
Notes: Limited by max_connections setting (default 100)
Connection limit: SHOW max_connections;

execution:
  max_workers: 8
  warehouse_limits:
    postgres: 8

MySQL

Recommended max_workers: 4-8
Notes: Similar to Postgres, check max_connections
Connection limit: SHOW VARIABLES LIKE 'max_connections';

execution:
  max_workers: 8
  warehouse_limits:
    mysql: 8

BigQuery

Recommended max_workers: 10-15
Notes: Cloud-based, scales automatically
Rate limits: Be aware of project-level query quotas

execution:
  max_workers: 15
  warehouse_limits:
    bigquery: 15

Redshift

Recommended max_workers: 8-12
Notes: WLM (Workload Management) affects concurrency
Connection limit: Check max_connections parameter

execution:
  max_workers: 10
  warehouse_limits:
    redshift: 10

SQLite

Recommended max_workers: 1 (forced)
Notes: Single writer limitation
Behavior: Baselinr automatically forces sequential execution for SQLite

execution:
  max_workers: 1  # SQLite ignores values > 1

Dagster Integration

Important for Dagster Users

Dagster already provides asset-level parallelism. Each table is a separate asset, and Dagster executes independent assets in parallel by default.

Default Behavior (Recommended for Dagster)

Keep max_workers=1 (default). Dagster handles parallelism:

# Recommended for Dagster users
execution:
  max_workers: 1  # Sequential within each asset

Dagster will still parallelize across assets:

Dagster Asset 1 (customers)  ─┐
Dagster Asset 2 (products)   ─┤─── Parallel at asset level
Dagster Asset 3 (orders)     ─┘

When Parallelism Helps in Dagster

Parallelism can be useful if:

Batching multiple tables in a single asset

@asset
def profile_all_dimension_tables():
    # Profile 10 dimension tables in parallel within this asset
    config = load_config()  # max_workers=5
    engine = ProfileEngine(config)
    engine.profile()  # Profiles 5 tables at a time

Fine-grained control over warehouse load

execution:
  max_workers: 4
  warehouse_limits:
    postgres: 4  # Limit load independently of Dagster executor

Want more control than Dagster's executor provides

Performance Impact

Expected Speedup

For CLI users with many tables:

Tables	Sequential	Parallel (8 workers)	Speedup
10	~10 min	~2 min	~5x
50	~50 min	~7 min	~7x
100	~100 min	~15 min	~7x
500	~8.3 hrs	~1.2 hrs	~7x

Assumes 1 minute per table on average

Factors Affecting Performance

I/O bound: Profiling is mostly I/O (database queries), so parallelism helps significantly
Warehouse capacity: More concurrent queries = more warehouse load
Network latency: Higher latency benefits more from parallelism
Table size: Large tables may still bottleneck
Query complexity: Simple metrics profile faster than complex ones

Troubleshooting

Issue: No performance improvement

Symptoms: Parallel execution is no faster than sequential

Possible causes:

Warehouse is the bottleneck (saturated)
Too few tables to benefit from parallelism
Connection pool is limiting
Network bandwidth is limiting

Solutions:

Check warehouse CPU/memory usage
Increase warehouse size (Snowflake) or connections (Postgres)
Verify max_workers is > 1
Check connection pool settings

Issue: Connection errors

Symptoms: Too many connections or Connection refused

Possible causes:

max_workers exceeds warehouse connection limit
Other applications are using connections
Connection pool not properly sized

Solutions:

execution:
  max_workers: 4  # Reduce workers
  warehouse_limits:
    postgres: 4   # Set warehouse-specific limit

Check warehouse connection limit:

-- Postgres
SHOW max_connections;
SELECT count(*) FROM pg_stat_activity;

-- MySQL
SHOW VARIABLES LIKE 'max_connections';
SHOW STATUS LIKE 'Threads_connected';

Issue: SQLite errors with parallelism

Symptoms: Database locked errors

Solution: SQLite automatically forces sequential execution. This is expected behavior.

execution:
  max_workers: 8  # Ignored for SQLite
  warehouse_limits:
    sqlite: 1     # Forced to 1

Issue: One table fails, others succeed

Symptoms: Some tables profiled, one failed

This is expected behavior - error isolation ensures one table failure doesn't abort other tables.

Check logs for the specific failure:

ERROR: Failed to profile table orders: Connection timeout
INFO: Parallel profiling completed: 49 succeeded, 1 failed

Issue: Metrics show no active workers

Symptoms: baselinr_active_workers metric is 0

Possible causes:

Metrics not enabled
Using sequential execution (max_workers=1)
No profiling currently running

Check:

monitoring:
  enable_metrics: true

execution:
  max_workers: 8  # Must be > 1

Best Practices

Start small: Try max_workers=2 first, then scale up
Monitor warehouse: Watch CPU, memory, connections during parallel runs
Use warehouse limits: Set conservative per-warehouse limits
CLI-focused: Use parallelism primarily for CLI execution
Dagster users: Keep default (max_workers=1) unless batching
Test thoroughly: Verify results match sequential execution
Watch logs: Check for connection errors or timeouts
Consider cost: More parallelism = more warehouse compute cost

Examples

Example 1: CLI with many tables

# Profile 100 tables in parallel
execution:
  max_workers: 8
  warehouse_limits:
    postgres: 8

profiling:
  tables:
    - table: table_001
    - table: table_002
    # ... 98 more tables

Run:

baselinr profile --config config.yml

Expected: ~7x faster than sequential

Example 2: Dagster with batching

# examples/dagster_repository.py
from baselinr.profiling.core import ProfileEngine
from baselinr.config.loader import load_config

@asset
def profile_dimension_tables():
    """Profile all dimension tables in parallel within this asset."""
    config = load_config("config_parallel.yml")
    engine = ProfileEngine(config)
    results = engine.profile()
    return results

# config_parallel.yml
execution:
  max_workers: 5  # Profile 5 dimension tables concurrently

profiling:
  tables:
    - table: dim_customer
    - table: dim_product
    - table: dim_date
    - table: dim_location
    - table: dim_category

Example 3: Warehouse-specific tuning

# config_multiwarehouse.yml
execution:
  max_workers: 16  # Default

  # Fine-tuned per warehouse
  warehouse_limits:
    snowflake: 20   # Snowflake can handle more
    postgres: 8     # Postgres more conservative
    redshift: 12    # Redshift moderate
    sqlite: 1       # SQLite single writer

Support

If you encounter issues or have questions:

Check Troubleshooting section
Review logs for specific errors
Try sequential execution (max_workers=1) to isolate the issue
Reduce max_workers and scale up gradually

Table of Contents​

Overview​

When to Use Parallelism​

✅ Use Parallelism When:​

❌ Don't Use Parallelism When:​

Configuration​

Basic Configuration​

Warehouse-Specific Limits​

Configuration Parameters​

How It Works​

Sequential Execution (Default)​

Parallel Execution​

Connection Pool Sizing​

Warehouse-Specific Considerations​

Snowflake​

Postgres​

MySQL​

BigQuery​

Redshift​

SQLite​

Dagster Integration​

Important for Dagster Users​

Default Behavior (Recommended for Dagster)​

When Parallelism Helps in Dagster​

Performance Impact​

Expected Speedup​

Factors Affecting Performance​

Troubleshooting​

Issue: No performance improvement​

Issue: Connection errors​

Issue: SQLite errors with parallelism​

Issue: One table fails, others succeed​

Issue: Metrics show no active workers​

Best Practices​

Examples​

Example 1: CLI with many tables​

Example 2: Dagster with batching​

Example 3: Warehouse-specific tuning​

Related Documentation​

Support​

Table of Contents

Overview

When to Use Parallelism

✅ Use Parallelism When:

❌ Don't Use Parallelism When:

Configuration

Basic Configuration

Warehouse-Specific Limits

Configuration Parameters

How It Works

Sequential Execution (Default)

Parallel Execution

Connection Pool Sizing

Warehouse-Specific Considerations

Snowflake

Postgres

MySQL

BigQuery

Redshift

SQLite

Dagster Integration

Important for Dagster Users

Default Behavior (Recommended for Dagster)

When Parallelism Helps in Dagster

Performance Impact

Expected Speedup

Factors Affecting Performance

Troubleshooting

Issue: No performance improvement

Issue: Connection errors

Issue: SQLite errors with parallelism

Issue: One table fails, others succeed

Issue: Metrics show no active workers

Best Practices

Examples

Example 1: CLI with many tables

Example 2: Dagster with batching

Example 3: Warehouse-specific tuning

Related Documentation

Support