Partition-Aware Profiling and Sampling
Baselinr now supports advanced partition-aware profiling and flexible sampling strategies to efficiently profile large datasets.
🎯 Overview
Instead of the simple sample_ratio field, Baselinr now provides:
- Partition-aware profiling - Profile specific partitions only
- Adaptive sampling - Multiple sampling methods with configurable parameters
- Combined strategies - Use both partition filtering and sampling together
📋 Configuration Schema
Basic Table (No Partitioning or Sampling)
profiling:
tables:
- table: customers
schema: public
This profiles the entire table - the default behavior.
Partition-Aware Profiling
Profile only specific partitions of your data:
profiling:
tables:
- table: events
schema: analytics
partition:
key: event_date # Partition column name
strategy: latest # Options: latest | recent_n | sample | all
metadata_fallback: true # Auto-infer partition key if not specified
Partition Strategies:
-
latest- Profile only the most recent partitionpartition:
key: event_date
strategy: latest -
recent_n- Profile the N most recent partitionspartition:
key: event_date
strategy: recent_n
recent_n: 7 # Last 7 days -
sample- Sample from partition values (planned, not yet implemented) -
all- Profile all partitions (explicit default)
Sampling
Apply sampling to reduce data volume:
profiling:
tables:
- table: large_table
schema: public
sampling:
enabled: true
method: random # Options: random | stratified | topk
fraction: 0.05 # 5% sample
max_rows: 1000000 # Cap at 1M rows
Sampling Methods:
random- Simple random samplingstratified- Stratified sampling (preserves distribution)topk- Top-K sampling (planned)
Sampling Parameters:
enabled- Turn sampling on/off (default: false)method- Sampling algorithmfraction- Fraction of rows to sample (0.0-1.0)max_rows- Maximum rows to sample (optional cap)
Combined Partition + Sampling
For ultimate efficiency on large tables:
profiling:
tables:
- table: clickstream
schema: analytics
partition:
key: date
strategy: latest
sampling:
enabled: true
method: stratified
fraction: 0.01 # 1% sample
max_rows: 500000
This example:
- Filters to the latest partition (e.g., today's data)
- Then samples 1% of that partition
- Caps at 500,000 rows maximum
🔧 Features
Automatic Partition Key Inference
If you enable metadata_fallback, Baselinr will try to infer the partition key:
partition:
strategy: latest
metadata_fallback: true # Will look for columns named: date, event_date, etc.
Common patterns it looks for:
date,event_date,partition_datecreated_at,updated_attimestamp,dt,ds- Any DATE or TIMESTAMP columns
Warehouse-Specific SQL
Baselinr generates warehouse-appropriate SQL:
PostgreSQL:
SELECT * FROM events
WHERE event_date = (SELECT MAX(event_date) FROM events)
TABLESAMPLE SYSTEM (1.0);
Snowflake:
SELECT * FROM events
WHERE event_date = (SELECT MAX(event_date) FROM events)
SAMPLE (1.0);
SQLite:
SELECT * FROM events
WHERE date = (SELECT MAX(date) FROM events)
ORDER BY RANDOM() LIMIT 500000;
📊 Examples
Example 1: Daily Partitioned Table
Profile only today's data:
tables:
- table: daily_events
schema: analytics
partition:
key: ds # Common partition column name
strategy: latest
Example 2: Large Historical Table
Sample for efficiency:
tables:
- table: historical_transactions
schema: finance
sampling:
enabled: true
method: random
fraction: 0.001 # 0.1% sample
max_rows: 100000
Example 3: Recent Week's Data with Sampling
tables:
- table: user_events
schema: analytics
partition:
key: event_date
strategy: recent_n
recent_n: 7
sampling:
enabled: true
method: stratified
fraction: 0.1
Example 4: Full Table (Default)
No special configuration needed:
tables:
- table: small_lookup_table
schema: public
🚀 CLI Usage
Preview Plan
See how partitioning and sampling will be applied:
baselinr plan --config config.yml
Output will show:
1. analytics.events
Status: ready
Partition: latest on event_date
Sampling: stratified (1.00%), max 500,000 rows
Run Profiling
baselinr profile --config config.yml
Results metadata will include:
- Which partition strategy was used
- Which partitions were profiled
- Sampling method and fraction applied
- Actual row count profiled
📈 Performance Benefits
Before (sample_ratio)
Profile 10% of a 1B row table:
- Scans: 100M rows
- Time: ~10 minutes
After (partition + sampling)
Profile latest partition (10M rows) with 1% sample:
- Scans: 100K rows (1000x reduction!)
- Time: ~6 seconds
Example Speedup
For a table with daily partitions (365 partitions, 10M rows each):
Full table with sampling:
sampling:
enabled: true
fraction: 0.001 # 0.1%
# Profiles: 3.65M rows (0.1% of 3.65B)
Latest partition only:
partition:
key: date
strategy: latest
# Profiles: 10M rows (just today)
Latest partition with sampling:
partition:
key: date
strategy: latest
sampling:
enabled: true
fraction: 0.01 # 1%
# Profiles: 100K rows (1% of 10M)
🔄 Migration Guide
Old Configuration
profiling:
tables:
- table: events
schema: public
sample_ratio: 0.1
default_sample_ratio: 1.0
New Configuration
Full table (no sampling):
profiling:
tables:
- table: events
schema: public
With sampling:
profiling:
tables:
- table: events
schema: public
sampling:
enabled: true
method: random
fraction: 0.1
With partition awareness:
profiling:
tables:
- table: events
schema: public
partition:
key: event_date
strategy: latest
sampling:
enabled: true
fraction: 0.1
🎓 Best Practices
1. Start with Partition Filtering
For time-series data, profile only recent partitions:
partition:
key: date
strategy: recent_n
recent_n: 7 # Last week
2. Add Sampling for Very Large Partitions
If even one partition is huge:
partition:
key: date
strategy: latest
sampling:
enabled: true
fraction: 0.01
max_rows: 1000000 # Safety cap
3. Use Stratified Sampling for Diverse Data
When data has important subgroups:
sampling:
enabled: true
method: stratified # Preserves distribution
fraction: 0.05
4. Profile Small Tables Fully
No need for complexity on small tables:
tables:
- table: lookup_table # Just specify the table
schema: public
🔮 Future Enhancements
Planned features:
- Cost-aware profiling - Estimate query cost before profiling
- Incremental profiling - Profile only new partitions
- Adaptive sampling - Auto-adjust based on data characteristics
- Multi-column partitioning - Support compound partition keys
- Custom partition strategies - Plugin your own logic
📝 See Also
- README.md - Main documentation
- examples/config_advanced.yml - All features demonstrated
- DEVELOPMENT.md - Architecture details
- baselinr/profiling/query_builder.py - Implementation