Partition-Aware Profiling and Sampling

Baselinr now supports advanced partition-aware profiling and flexible sampling strategies to efficiently profile large datasets.

🎯 Overview

Instead of the simple sample_ratio field, Baselinr now provides:

Partition-aware profiling - Profile specific partitions only
Adaptive sampling - Multiple sampling methods with configurable parameters
Combined strategies - Use both partition filtering and sampling together

📋 Configuration Schema

Basic Table (No Partitioning or Sampling)

profiling:
  tables:
    - table: customers
      schema: public

This profiles the entire table - the default behavior.

Partition-Aware Profiling

Profile only specific partitions of your data:

profiling:
  tables:
    - table: events
      schema: analytics
      partition:
        key: event_date              # Partition column name
        strategy: latest             # Options: latest | recent_n | sample | all
        metadata_fallback: true      # Auto-infer partition key if not specified

Partition Strategies:

latest - Profile only the most recent partition

partition:
  key: event_date
  strategy: latest

recent_n - Profile the N most recent partitions

partition:
  key: event_date
  strategy: recent_n
  recent_n: 7  # Last 7 days

sample - Sample from partition values (planned, not yet implemented)
all - Profile all partitions (explicit default)

Sampling

Apply sampling to reduce data volume:

profiling:
  tables:
    - table: large_table
      schema: public
      sampling:
        enabled: true
        method: random               # Options: random | stratified | topk
        fraction: 0.05               # 5% sample
        max_rows: 1000000            # Cap at 1M rows

Sampling Methods:

random - Simple random sampling
stratified - Stratified sampling (preserves distribution)
topk - Top-K sampling (planned)

Sampling Parameters:

enabled - Turn sampling on/off (default: false)
method - Sampling algorithm
fraction - Fraction of rows to sample (0.0-1.0)
max_rows - Maximum rows to sample (optional cap)

Combined Partition + Sampling

For ultimate efficiency on large tables:

profiling:
  tables:
    - table: clickstream
      schema: analytics
      partition:
        key: date
        strategy: latest
      sampling:
        enabled: true
        method: stratified
        fraction: 0.01              # 1% sample
        max_rows: 500000

This example:

Filters to the latest partition (e.g., today's data)
Then samples 1% of that partition
Caps at 500,000 rows maximum

🔧 Features

Automatic Partition Key Inference

If you enable metadata_fallback, Baselinr will try to infer the partition key:

partition:
  strategy: latest
  metadata_fallback: true  # Will look for columns named: date, event_date, etc.

Common patterns it looks for:

date, event_date, partition_date
created_at, updated_at
timestamp, dt, ds
Any DATE or TIMESTAMP columns

Warehouse-Specific SQL

Baselinr generates warehouse-appropriate SQL:

PostgreSQL:

SELECT * FROM events 
WHERE event_date = (SELECT MAX(event_date) FROM events)
TABLESAMPLE SYSTEM (1.0);

Snowflake:

SELECT * FROM events 
WHERE event_date = (SELECT MAX(event_date) FROM events)
SAMPLE (1.0);

SQLite:

SELECT * FROM events 
WHERE date = (SELECT MAX(date) FROM events)
ORDER BY RANDOM() LIMIT 500000;

📊 Examples

Example 1: Daily Partitioned Table

Profile only today's data:

tables:
  - table: daily_events
    schema: analytics
    partition:
      key: ds  # Common partition column name
      strategy: latest

Example 2: Large Historical Table

Sample for efficiency:

tables:
  - table: historical_transactions
    schema: finance
    sampling:
      enabled: true
      method: random
      fraction: 0.001  # 0.1% sample
      max_rows: 100000

Example 3: Recent Week's Data with Sampling

tables:
  - table: user_events
    schema: analytics
    partition:
      key: event_date
      strategy: recent_n
      recent_n: 7
    sampling:
      enabled: true
      method: stratified
      fraction: 0.1

Example 4: Full Table (Default)

No special configuration needed:

tables:
  - table: small_lookup_table
    schema: public

🚀 CLI Usage

Preview Plan

See how partitioning and sampling will be applied:

baselinr plan --config config.yml

Output will show:

1. analytics.events
   Status: ready
   Partition: latest on event_date
   Sampling: stratified (1.00%), max 500,000 rows

Run Profiling

baselinr profile --config config.yml

Results metadata will include:

Which partition strategy was used
Which partitions were profiled
Sampling method and fraction applied
Actual row count profiled

📈 Performance Benefits

Before (sample_ratio)

Profile 10% of a 1B row table:

Scans: 100M rows
Time: ~10 minutes

After (partition + sampling)

Profile latest partition (10M rows) with 1% sample:

Scans: 100K rows (1000x reduction!)
Time: ~6 seconds

Example Speedup

For a table with daily partitions (365 partitions, 10M rows each):

Full table with sampling:

sampling:
  enabled: true
  fraction: 0.001  # 0.1%
# Profiles: 3.65M rows (0.1% of 3.65B)

Latest partition only:

partition:
  key: date
  strategy: latest
# Profiles: 10M rows (just today)

Latest partition with sampling:

partition:
  key: date
  strategy: latest
sampling:
  enabled: true
  fraction: 0.01  # 1%
# Profiles: 100K rows (1% of 10M)

🔄 Migration Guide

Old Configuration

profiling:
  tables:
    - table: events
      schema: public
      sample_ratio: 0.1
  default_sample_ratio: 1.0

New Configuration

Full table (no sampling):

profiling:
  tables:
    - table: events
      schema: public

With sampling:

profiling:
  tables:
    - table: events
      schema: public
      sampling:
        enabled: true
        method: random
        fraction: 0.1

With partition awareness:

profiling:
  tables:
    - table: events
      schema: public
      partition:
        key: event_date
        strategy: latest
      sampling:
        enabled: true
        fraction: 0.1

🎓 Best Practices

1. Start with Partition Filtering

For time-series data, profile only recent partitions:

partition:
  key: date
  strategy: recent_n
  recent_n: 7  # Last week

2. Add Sampling for Very Large Partitions

If even one partition is huge:

partition:
  key: date
  strategy: latest
sampling:
  enabled: true
  fraction: 0.01
  max_rows: 1000000  # Safety cap

3. Use Stratified Sampling for Diverse Data

When data has important subgroups:

sampling:
  enabled: true
  method: stratified  # Preserves distribution
  fraction: 0.05

4. Profile Small Tables Fully

No need for complexity on small tables:

tables:
  - table: lookup_table  # Just specify the table
    schema: public

🔮 Future Enhancements

Planned features:

Cost-aware profiling - Estimate query cost before profiling
Incremental profiling - Profile only new partitions
Adaptive sampling - Auto-adjust based on data characteristics
Multi-column partitioning - Support compound partition keys
Custom partition strategies - Plugin your own logic

📝 See Also

README.md - Main documentation
examples/config_advanced.yml - All features demonstrated
DEVELOPMENT.md - Architecture details
baselinr/profiling/query_builder.py - Implementation

🎯 Overview​

📋 Configuration Schema​

Basic Table (No Partitioning or Sampling)​

Partition-Aware Profiling​

Sampling​

Combined Partition + Sampling​

🔧 Features​

Automatic Partition Key Inference​

Warehouse-Specific SQL​

📊 Examples​

Example 1: Daily Partitioned Table​

Example 2: Large Historical Table​

Example 3: Recent Week's Data with Sampling​

Example 4: Full Table (Default)​

🚀 CLI Usage​

Preview Plan​

Run Profiling​

📈 Performance Benefits​

Before (sample_ratio)​

After (partition + sampling)​

Example Speedup​

🔄 Migration Guide​

Old Configuration​

New Configuration​

🎓 Best Practices​

1. Start with Partition Filtering​

2. Add Sampling for Very Large Partitions​

3. Use Stratified Sampling for Diverse Data​

4. Profile Small Tables Fully​

🔮 Future Enhancements​

📝 See Also​