Dataset Configuration Migration Guide

This guide explains how to migrate from inline dataset configuration to the new file-based directory structure.

Overview

Baselinr now supports organizing dataset configurations in a modular file-based structure, similar to dbt projects. This provides better organization, version control, and scalability compared to inline configuration.

When to Migrate

You should migrate to the file-based structure if:

You have multiple dataset configurations
You want better organization and maintainability
You need to collaborate on dataset configs with a team
You want to scale to many datasets

Migration Methods

Automated Migration (Recommended)

Use the migrate-config CLI command to automatically convert inline configs to files:

baselinr migrate-config --config config.yml --output-dir datasets

Options:

--config, -c: Path to your configuration file (required)
--output-dir: Directory name for dataset files (default: datasets)
--no-backup: Skip creating a backup of the original config

Example:

# Migrate with default settings (creates backup)
baselinr migrate-config --config config.yml

# Migrate to custom directory
baselinr migrate-config --config config.yml --output-dir my_datasets

# Migrate without backup
baselinr migrate-config --config config.yml --no-backup

What the migration tool does:

Creates a backup of your original config file (unless --no-backup is used)
Extracts all inline dataset configs to individual YAML files
Migrates profiling.schemas[] entries to {schema}_schema.yml files
Migrates profiling.databases[] entries to {database}_database.yml files
Updates your main config file to reference the datasets directory
Removes profiling.schemas and profiling.databases from the config
Preserves all configuration settings

Manual Migration

If you prefer to migrate manually:

Create datasets directory:
```
mkdir datasets
```
Extract dataset configs: For each dataset in your datasets.datasets section, create a file:
- datasets/{table_name}.yml for table-specific configs
- datasets/{schema_name}_schema.yml for schema-level configs
- datasets/{database_name}_database.yml for database-level configs
Migrate profiling.schemas and profiling.databases: If you have profiling.schemas[] or profiling.databases[] entries, migrate them to dataset files:
- Convert each entry in profiling.schemas[] to datasets/{schema}_schema.yml
- Convert each entry in profiling.databases[] to datasets/{database}_database.yml
- Remove profiling.schemas and profiling.databases from your config

Update main config: Replace the inline datasets section with:

datasets:
  datasets_dir: ./datasets
  auto_discover: true
  recursive: true

Migration Examples

Before (Inline Configuration)

# config.yml
environment: development
source:
  type: postgres
  host: localhost
  database: mydb
  username: user
  password: pass

storage:
  connection:
    type: postgres
    host: localhost
    database: mydb
    username: user
    password: pass
  results_table: results
  runs_table: runs

datasets:
  datasets:
    - table: customers
      schema: public
      profiling:
        partition:
          key: created_at
          strategy: latest
        sampling:
          enabled: true
          fraction: 0.1
      drift:
        strategy: statistical
    - table: orders
      schema: public
      profiling:
        partition:
          key: order_date
          strategy: latest

After (File-Based Configuration)

config.yml:

environment: development
source:
  type: postgres
  host: localhost
  database: mydb
  username: user
  password: pass

storage:
  connection:
    type: postgres
    host: localhost
    database: mydb
    username: user
    password: pass
  results_table: results
  runs_table: runs

datasets:
  datasets_dir: ./datasets
  auto_discover: true
  recursive: true

datasets/customers.yml:

table: customers
schema: public
profiling:
  partition:
    key: created_at
    strategy: latest
  sampling:
    enabled: true
    fraction: 0.1
drift:
  strategy: statistical

datasets/orders.yml:

table: orders
schema: public
profiling:
  partition:
    key: order_date
    strategy: latest

Migrating profiling.schemas and profiling.databases

If you have schema-level or database-level configurations in profiling.schemas[] or profiling.databases[], the migration tool will automatically convert them:

Before:

profiling:
  schemas:
    - schema: analytics
      partition:
        strategy: latest
        key: date
      columns:
        - name: "*_id"
          drift:
            enabled: false
  databases:
    - database: warehouse
      sampling:
        enabled: true
        fraction: 0.05

After:

profiling:
  tables:
    - select_schema: true
      schema: analytics
    - select_all_schemas: true
      database: warehouse

datasets:
  datasets_dir: ./datasets
  auto_discover: true
  recursive: true

datasets/analytics_schema.yml:

schema: analytics
profiling:
  partition:
    strategy: latest
    key: date
  columns:
    - name: "*_id"
      drift:
        enabled: false

datasets/warehouse_database.yml:

database: warehouse
profiling:
  sampling:
    enabled: true
    fraction: 0.05

File Naming Conventions

The migration tool automatically generates filenames based on the dataset identifier:

Table-level configs: {table_name}.yml
- Example: customers.yml, orders.yml
Schema-level configs: {schema_name}_schema.yml
- Example: public_schema.yml, analytics_schema.yml
Database-level configs: {database_name}_database.yml
- Example: warehouse_database.yml

Directory Structure

You can organize dataset files in a hierarchical structure:

config.yml
datasets/
  ├── customers.yml
  ├── orders.yml
  ├── analytics/
  │   ├── _schema.yml        # Schema-level config for analytics
  │   ├── events.yml
  │   └── metrics.yml
  └── warehouse/
      ├── _database.yml      # Database-level config
      └── staging/
          └── _schema.yml

Verification

After migration, verify your configuration:

Check that files were created:
```
ls -la datasets/
```
Validate the configuration:
```
baselinr plan --config config.yml
```

Test loading:

from baselinr.config.loader import ConfigLoader

config = ConfigLoader.load_from_file("config.yml")
print(f"Loaded {len(config.datasets.datasets)} datasets")

Troubleshooting

Migration fails with "No inline dataset configs found"

This means your config doesn't have inline dataset configs. You may already be using the file-based structure, or you need to add dataset configs first.

Files not being discovered

Check:

The datasets_dir path is correct (relative to config file or absolute)
Files match the file_pattern (default: *.yml)
recursive is set to true if using subdirectories
Files aren't excluded by exclude_patterns

Configuration errors after migration

Check that all dataset files are valid YAML
Verify dataset identifiers (table, schema, database) are correct
Ensure the main config file references the datasets directory correctly

Backup file location

Backup files are created in the same directory as your config file with the format:

{config_file}.backup.{timestamp}.{ext}

Example: config.yml.backup.20240101_120000.yml

Best Practices

Use descriptive filenames: Match table/schema names for clarity
Organize by schema/database: Use subdirectories for logical grouping
Version control: Commit dataset files to version control
Documentation: Add comments in dataset files for complex configurations
Backup before migration: Always create backups (default behavior)

Next Steps

After migration:

Review generated files and organize as needed
Update your team's workflow to edit dataset files directly
Consider using schema-level or database-level configs for common settings
Set up CI/CD to validate dataset configurations

Final Notes

After Migration

Once you've migrated to the file-based structure:

Remove old config sections: The migration tool automatically removes migrated sections, but verify:
- profiling.schemas[] should be removed (migrated to {schema}_schema.yml files)
- profiling.databases[] should be removed (migrated to {database}_database.yml files)
- validation.rules[] should be removed (migrated to dataset files)
- Inline datasets.datasets[] should be removed (migrated to individual files)
Verify configuration: Always validate after migration:
```
baselinr validate-config --config config.yml
```
Test profiling: Run a test profile to ensure everything works:
```
baselinr profile --dry-run
```
Update workflows:
- Update CI/CD pipelines to validate dataset files
- Update team documentation to reference new file locations
- Consider using the dashboard UI for managing dataset configs

Important Reminders

profiling.tables[] is still valid: This section is for table selection patterns, not dataset-specific configs. It remains in the main config file.
Global defaults stay in config.yml: Only dataset-specific overrides go in the datasets/ directory.
File naming matters: Use exact naming conventions ({table}.yml, {schema}_schema.yml, {database}_database.yml).
Precedence is important: Table-level configs override schema-level, which override database-level, which override global defaults.

Configuration Reference - Complete config reference
Dataset Configuration Guide - Detailed dataset config guide
Best Practices Guide - Configuration best practices

Overview​

When to Migrate​

Migration Methods​

Automated Migration (Recommended)​

Manual Migration​

Migration Examples​

Before (Inline Configuration)​

After (File-Based Configuration)​

Migrating profiling.schemas and profiling.databases​

File Naming Conventions​

Directory Structure​

Verification​

Troubleshooting​

Migration fails with "No inline dataset configs found"​

Files not being discovered​

Configuration errors after migration​

Backup file location​

Best Practices​

Next Steps​

Final Notes​

After Migration​

Important Reminders​

Related Documentation​