Dataset Configuration Migration Guide
This guide explains how to migrate from inline dataset configuration to the new file-based directory structure.
Overview
Baselinr now supports organizing dataset configurations in a modular file-based structure, similar to dbt projects. This provides better organization, version control, and scalability compared to inline configuration.
When to Migrate
You should migrate to the file-based structure if:
- You have multiple dataset configurations
- You want better organization and maintainability
- You need to collaborate on dataset configs with a team
- You want to scale to many datasets
Migration Methods
Automated Migration (Recommended)
Use the migrate-config CLI command to automatically convert inline configs to files:
baselinr migrate-config --config config.yml --output-dir datasets
Options:
--config, -c: Path to your configuration file (required)--output-dir: Directory name for dataset files (default:datasets)--no-backup: Skip creating a backup of the original config
Example:
# Migrate with default settings (creates backup)
baselinr migrate-config --config config.yml
# Migrate to custom directory
baselinr migrate-config --config config.yml --output-dir my_datasets
# Migrate without backup
baselinr migrate-config --config config.yml --no-backup
What the migration tool does:
- Creates a backup of your original config file (unless
--no-backupis used) - Extracts all inline dataset configs to individual YAML files
- Migrates
profiling.schemas[]entries to{schema}_schema.ymlfiles - Migrates
profiling.databases[]entries to{database}_database.ymlfiles - Updates your main config file to reference the datasets directory
- Removes
profiling.schemasandprofiling.databasesfrom the config - Preserves all configuration settings
Manual Migration
If you prefer to migrate manually:
-
Create datasets directory:
mkdir datasets -
Extract dataset configs: For each dataset in your
datasets.datasetssection, create a file:datasets/{table_name}.ymlfor table-specific configsdatasets/{schema_name}_schema.ymlfor schema-level configsdatasets/{database_name}_database.ymlfor database-level configs
-
Migrate profiling.schemas and profiling.databases: If you have
profiling.schemas[]orprofiling.databases[]entries, migrate them to dataset files:- Convert each entry in
profiling.schemas[]todatasets/{schema}_schema.yml - Convert each entry in
profiling.databases[]todatasets/{database}_database.yml - Remove
profiling.schemasandprofiling.databasesfrom your config
- Convert each entry in
-
Update main config: Replace the inline
datasetssection with:datasets:
datasets_dir: ./datasets
auto_discover: true
recursive: true
Migration Examples
Before (Inline Configuration)
# config.yml
environment: development
source:
type: postgres
host: localhost
database: mydb
username: user
password: pass
storage:
connection:
type: postgres
host: localhost
database: mydb
username: user
password: pass
results_table: results
runs_table: runs
datasets:
datasets:
- table: customers
schema: public
profiling:
partition:
key: created_at
strategy: latest
sampling:
enabled: true
fraction: 0.1
drift:
strategy: statistical
- table: orders
schema: public
profiling:
partition:
key: order_date
strategy: latest
After (File-Based Configuration)
config.yml:
environment: development
source:
type: postgres
host: localhost
database: mydb
username: user
password: pass
storage:
connection:
type: postgres
host: localhost
database: mydb
username: user
password: pass
results_table: results
runs_table: runs
datasets:
datasets_dir: ./datasets
auto_discover: true
recursive: true
datasets/customers.yml:
table: customers
schema: public
profiling:
partition:
key: created_at
strategy: latest
sampling:
enabled: true
fraction: 0.1
drift:
strategy: statistical
datasets/orders.yml:
table: orders
schema: public
profiling:
partition:
key: order_date
strategy: latest
Migrating profiling.schemas and profiling.databases
If you have schema-level or database-level configurations in profiling.schemas[] or profiling.databases[], the migration tool will automatically convert them:
Before:
profiling:
schemas:
- schema: analytics
partition:
strategy: latest
key: date
columns:
- name: "*_id"
drift:
enabled: false
databases:
- database: warehouse
sampling:
enabled: true
fraction: 0.05
After:
profiling:
tables:
- select_schema: true
schema: analytics
- select_all_schemas: true
database: warehouse
datasets:
datasets_dir: ./datasets
auto_discover: true
recursive: true
datasets/analytics_schema.yml:
schema: analytics
profiling:
partition:
strategy: latest
key: date
columns:
- name: "*_id"
drift:
enabled: false
datasets/warehouse_database.yml:
database: warehouse
profiling:
sampling:
enabled: true
fraction: 0.05
File Naming Conventions
The migration tool automatically generates filenames based on the dataset identifier:
-
Table-level configs:
{table_name}.yml- Example:
customers.yml,orders.yml
- Example:
-
Schema-level configs:
{schema_name}_schema.yml- Example:
public_schema.yml,analytics_schema.yml
- Example:
-
Database-level configs:
{database_name}_database.yml- Example:
warehouse_database.yml
- Example:
Directory Structure
You can organize dataset files in a hierarchical structure:
config.yml
datasets/
├── customers.yml
├── orders.yml
├── analytics/
│ ├── _schema.yml # Schema-level config for analytics
│ ├── events.yml
│ └── metrics.yml
└── warehouse/
├── _database.yml # Database-level config
└── staging/
└── _schema.yml
Verification
After migration, verify your configuration:
-
Check that files were created:
ls -la datasets/ -
Validate the configuration:
baselinr plan --config config.yml -
Test loading:
from baselinr.config.loader import ConfigLoader
config = ConfigLoader.load_from_file("config.yml")
print(f"Loaded {len(config.datasets.datasets)} datasets")
Troubleshooting
Migration fails with "No inline dataset configs found"
This means your config doesn't have inline dataset configs. You may already be using the file-based structure, or you need to add dataset configs first.
Files not being discovered
Check:
- The
datasets_dirpath is correct (relative to config file or absolute) - Files match the
file_pattern(default:*.yml) recursiveis set totrueif using subdirectories- Files aren't excluded by
exclude_patterns
Configuration errors after migration
- Check that all dataset files are valid YAML
- Verify dataset identifiers (table, schema, database) are correct
- Ensure the main config file references the datasets directory correctly
Backup file location
Backup files are created in the same directory as your config file with the format:
{config_file}.backup.{timestamp}.{ext}
Example: config.yml.backup.20240101_120000.yml
Best Practices
- Use descriptive filenames: Match table/schema names for clarity
- Organize by schema/database: Use subdirectories for logical grouping
- Version control: Commit dataset files to version control
- Documentation: Add comments in dataset files for complex configurations
- Backup before migration: Always create backups (default behavior)
Next Steps
After migration:
- Review generated files and organize as needed
- Update your team's workflow to edit dataset files directly
- Consider using schema-level or database-level configs for common settings
- Set up CI/CD to validate dataset configurations
Final Notes
After Migration
Once you've migrated to the file-based structure:
-
Remove old config sections: The migration tool automatically removes migrated sections, but verify:
profiling.schemas[]should be removed (migrated to{schema}_schema.ymlfiles)profiling.databases[]should be removed (migrated to{database}_database.ymlfiles)validation.rules[]should be removed (migrated to dataset files)- Inline
datasets.datasets[]should be removed (migrated to individual files)
-
Verify configuration: Always validate after migration:
baselinr validate-config --config config.yml -
Test profiling: Run a test profile to ensure everything works:
baselinr profile --dry-run -
Update workflows:
- Update CI/CD pipelines to validate dataset files
- Update team documentation to reference new file locations
- Consider using the dashboard UI for managing dataset configs
Important Reminders
profiling.tables[]is still valid: This section is for table selection patterns, not dataset-specific configs. It remains in the main config file.- Global defaults stay in
config.yml: Only dataset-specific overrides go in thedatasets/directory. - File naming matters: Use exact naming conventions (
{table}.yml,{schema}_schema.yml,{database}_database.yml). - Precedence is important: Table-level configs override schema-level, which override database-level, which override global defaults.
Related Documentation
- Configuration Reference - Complete config reference
- Dataset Configuration Guide - Detailed dataset config guide
- Best Practices Guide - Configuration best practices