Skip to main content

Baselinr Storage Schema Reference

Version: 1.0
Last Updated: 2024-11-16
Status: Production Ready

Overview

Baselinr stores profiling results, run metadata, drift events, and incremental state in five core tables. All tables use a consistent naming convention (baselinr_*) and are designed for multi-tenant, multi-warehouse deployments.

Schema Philosophy

  • Immutable History: Profiling results are never updated, only inserted
  • Composite Keys: Support multiple tables per profiling run
  • Flexible Metadata: TEXT/VARIANT columns for extensibility
  • Query Optimization: Indexes designed for common access patterns

Core Tables

1. baselinr_schema_version

Purpose: Tracks schema version for migration management and compatibility checking.

Schema:

ColumnTypeConstraintsDescription
versionINTEGERPRIMARY KEYSchema version number (sequential)
applied_atTIMESTAMPNOT NULLWhen migration was applied
descriptionVARCHAR(500)NULLHuman-readable description
migration_scriptVARCHAR(255)NULLMigration file name
checksumVARCHAR(64)NULLOptional integrity check

Indexes: None (small table, PK sufficient)

Use Cases:

  • Check current schema version on startup
  • Track migration history
  • Ensure code/database compatibility

Example Queries:

-- Get current schema version
SELECT version, description, applied_at
FROM baselinr_schema_version
ORDER BY version DESC
LIMIT 1;

-- View migration history
SELECT * FROM baselinr_schema_version
ORDER BY version;

2. baselinr_runs

Purpose: Tracks metadata for each profiling run, including execution time, row counts, and status.

Schema:

ColumnTypeConstraintsDescription
run_idVARCHAR(36)PRIMARY KEYUUID identifying the profiling run
dataset_nameVARCHAR(255)PRIMARY KEYTable name that was profiled
schema_nameVARCHAR(255)NULLSchema/database name
profiled_atTIMESTAMPNOT NULLWhen profiling executed
environmentVARCHAR(50)NULLEnvironment (dev/staging/prod)
statusVARCHAR(20)NULLRun status (completed/failed)
row_countINTEGERNULLNumber of rows profiled
column_countINTEGERNULLNumber of columns profiled

Composite Primary Key: (run_id, dataset_name)
Rationale: Allows multiple tables to be profiled in a single run while maintaining unique records per table.

Indexes:

  • idx_dataset_profiled on (dataset_name, profiled_at DESC) - Optimizes historical queries per table

Use Cases:

  • Query run history for a specific table
  • Track profiling frequency and status
  • Identify failed runs
  • Power dashboard overview
  • Filter by environment for multi-env deployments

Example Queries:

-- Get last 10 runs for a table
SELECT run_id, profiled_at, status, row_count
FROM baselinr_runs
WHERE dataset_name = 'customers'
ORDER BY profiled_at DESC
LIMIT 10;

-- Find failed runs in last 7 days
SELECT * FROM baselinr_runs
WHERE status = 'failed'
AND profiled_at > CURRENT_TIMESTAMP - INTERVAL '7 days';

-- Count runs by environment
SELECT environment, COUNT(*) as run_count
FROM baselinr_runs
WHERE profiled_at > CURRENT_TIMESTAMP - INTERVAL '30 days'
GROUP BY environment;

3. baselinr_results

Purpose: Stores individual column-level metrics for each profiling run.

Schema:

ColumnTypeConstraintsDescription
idINTEGERPRIMARY KEY, AUTO_INCREMENTUnique metric record ID
run_idVARCHAR(36)NOT NULLReferences baselinr_runs
dataset_nameVARCHAR(255)NOT NULLTable name
schema_nameVARCHAR(255)NULLSchema name
column_nameVARCHAR(255)NOT NULLColumn that was profiled
column_typeVARCHAR(100)NULLColumn data type
metric_nameVARCHAR(100)NOT NULLMetric identifier (e.g., "null_count")
metric_valueTEXTNULLMetric value as string
profiled_atTIMESTAMPNOT NULLWhen metric was captured

Foreign Key: (run_id, dataset_name) → baselinr_runs(run_id, dataset_name)

Indexes:

  • idx_run_id on (run_id) - Fast lookup by run
  • idx_dataset_column on (dataset_name, column_name) - Column history queries
  • idx_metric on (dataset_name, column_name, metric_name) - Specific metric trends

Metric Names (Standard):

  • null_count, null_percent - Null value statistics
  • distinct_count, distinct_percent - Cardinality
  • min_value, max_value - Range
  • mean, stddev - Statistical moments
  • histogram - Distribution (JSON string)
  • min_length, max_length, avg_length - String statistics

Use Cases:

  • Retrieve all metrics for a run
  • Track metric trends over time
  • Compare metrics across runs
  • Detect drift in specific metrics

Example Queries:

-- Get all metrics for a run
SELECT column_name, metric_name, metric_value
FROM baselinr_results
WHERE run_id = 'abc-123-def-456'
ORDER BY column_name, metric_name;

-- Track null_percent trend for a column
SELECT profiled_at, metric_value::FLOAT as null_percent
FROM baselinr_results
WHERE dataset_name = 'customers'
AND column_name = 'email'
AND metric_name = 'null_percent'
ORDER BY profiled_at DESC
LIMIT 30;

-- Find columns with high null rates
SELECT DISTINCT column_name, metric_value::FLOAT as null_percent
FROM baselinr_results
WHERE dataset_name = 'orders'
AND metric_name = 'null_percent'
AND metric_value::FLOAT > 10
ORDER BY null_percent DESC;

4. baselinr_events

Purpose: Stores drift detection alerts and other profiling events for historical tracking and analysis.

Schema:

ColumnTypeConstraintsDescription
event_idVARCHAR(36)PRIMARY KEYUnique event identifier (UUID)
event_typeVARCHAR(100)NOT NULLEvent type (e.g., "drift_detected")
table_nameVARCHAR(255)NULLAffected table
column_nameVARCHAR(255)NULLAffected column
metric_nameVARCHAR(100)NULLAffected metric
baseline_valueFLOATNULLPrevious/baseline value
current_valueFLOATNULLNew/current value
change_percentFLOATNULLPercentage change
drift_severityVARCHAR(20)NULLSeverity (low/medium/high)
timestampTIMESTAMPNOT NULLWhen event occurred
metadataTEXT/VARIANTNULLAdditional context (JSON)
created_atTIMESTAMPDEFAULT NOWRecord creation time

Indexes:

  • idx_event_type on (event_type) - Filter by event type
  • idx_table_name on (table_name) - Table-specific events
  • idx_timestamp on (timestamp DESC) - Recent events
  • idx_drift_severity on (drift_severity) - High-priority alerts

Event Types:

  • drift_detected - Metric drift alert
  • schema_change - Table schema modification
  • profiling_failure - Failed profiling attempt
  • data_quality_issue - Quality check failure

Use Cases:

  • Query recent drift events
  • Alert on high-severity issues
  • Audit trail for data changes
  • Drift detection history

Example Queries:

-- Recent high-severity drift events
SELECT table_name, column_name, metric_name,
change_percent, timestamp
FROM baselinr_events
WHERE drift_severity = 'high'
AND timestamp > CURRENT_TIMESTAMP - INTERVAL '7 days'
ORDER BY timestamp DESC;

-- Drift frequency by table
SELECT table_name, COUNT(*) as drift_count
FROM baselinr_events
WHERE event_type = 'drift_detected'
AND timestamp > CURRENT_TIMESTAMP - INTERVAL '30 days'
GROUP BY table_name
ORDER BY drift_count DESC;

-- All events for a specific column
SELECT event_type, drift_severity, change_percent, timestamp
FROM baselinr_events
WHERE table_name = 'orders'
AND column_name = 'total_amount'
ORDER BY timestamp DESC;

5. baselinr_table_state

Purpose: Tracks incremental profiling state per table for change detection and cost optimization.

Schema:

ColumnTypeConstraintsDescription
schema_nameVARCHAR(255)PRIMARY KEYSchema name
table_nameVARCHAR(255)PRIMARY KEYTable name
last_run_idVARCHAR(36)NULLMost recent run ID
snapshot_idVARCHAR(255)NULLChange tracking snapshot
change_tokenVARCHAR(255)NULLChange detection token
decisionVARCHAR(50)NULLProfiling decision (full/partial/skip)
decision_reasonVARCHAR(255)NULLWhy decision was made
last_profiled_atTIMESTAMPNULLLast profiling time
staleness_scoreINTEGERNULLHow stale the profile is
row_countBIGINTNULLLast known row count
bytes_scannedBIGINTNULLBytes scanned in last run
metadataTEXT/VARIANTNULLAdditional state (JSON)

Composite Primary Key: (schema_name, table_name)

Indexes: None (small table, accessed by PK)

Use Cases:

  • Incremental profiling decisions
  • Cost tracking and optimization
  • Detect unchanged tables (skip profiling)
  • Profile freshness monitoring

Example Queries:

-- Find stale tables (not profiled in 7+ days)
SELECT table_name, last_profiled_at,
CURRENT_TIMESTAMP - last_profiled_at as staleness
FROM baselinr_table_state
WHERE last_profiled_at < CURRENT_TIMESTAMP - INTERVAL '7 days'
ORDER BY staleness DESC;

-- Total bytes scanned by schema
SELECT schema_name, SUM(bytes_scanned) as total_bytes
FROM baselinr_table_state
GROUP BY schema_name;

-- Tables profiled today
SELECT table_name, decision, decision_reason
FROM baselinr_table_state
WHERE last_profiled_at > CURRENT_DATE;

Schema Versioning

Current Version: 1

Baselinr uses integer versioning starting at 1. The baselinr_schema_version table tracks all applied migrations.

Versioning Policy

Version Increments When:

  • Adding required columns (breaking)
  • Removing columns (breaking)
  • Renaming columns (breaking)
  • Changing column types (breaking)
  • Modifying primary keys (breaking)

Version DOES NOT Increment When:

  • Adding nullable/optional columns
  • Adding indexes
  • Adding new tables
  • Expanding column sizes
  • Adding comments

Compatibility

Baselinr supports N-1 compatibility (code can read one version behind):

  • Current Version: Code reads and writes v1
  • Supported Reading: Code can read v1 (no previous versions yet)
  • Migration Required: Warning shown if DB version ≠ code version

Version History

VersionDateDescriptionBreaking
12024-11-16Initial schema with all core tablesNo

Database-Specific Notes

PostgreSQL

  • Uses SERIAL for auto-increment
  • TIMESTAMP for datetime
  • TEXT for flexible metadata

Snowflake

  • Uses AUTOINCREMENT for auto-increment
  • TIMESTAMP_NTZ (no timezone) for datetime
  • VARIANT for structured metadata (JSON)

MySQL

  • Uses AUTO_INCREMENT for auto-increment
  • DATETIME for datetime
  • TEXT for flexible metadata

SQLite

  • Uses INTEGER PRIMARY KEY for auto-increment
  • TEXT for datetime (ISO8601 strings)
  • TEXT for flexible metadata

Query Performance Optimization

Index Usage Guidelines

  1. Time-based queries: Always use indexed timestamp columns
  2. Table lookups: Use dataset_name in WHERE clause
  3. Column history: Include both dataset_name and column_name
  4. Pagination: Use LIMIT/OFFSET with ORDER BY

Query Best Practices

-- ✅ GOOD: Uses index
SELECT * FROM baselinr_runs
WHERE dataset_name = 'customers'
ORDER BY profiled_at DESC
LIMIT 10;

-- ❌ BAD: No index on status alone
SELECT * FROM baselinr_runs
WHERE status = 'completed';

-- ✅ BETTER: Combine with indexed column
SELECT * FROM baselinr_runs
WHERE dataset_name = 'customers'
AND status = 'completed'
ORDER BY profiled_at DESC;

Migration Management

See MIGRATION_GUIDE.md for:

  • Creating new migrations
  • Testing migrations
  • Rolling back changes
  • Version upgrade procedures

Security Considerations

  1. Access Control: Restrict write access to Baselinr service account only
  2. Read Access: Grant read-only access to dashboard/query users
  3. Sensitive Data: Metric values stored as strings—avoid storing PII
  4. Audit Trail: All events logged in baselinr_events

Maintenance

  1. Retention Policy: Archive runs older than 90 days
  2. Index Maintenance: Rebuild indexes monthly
  3. Vacuum/Analyze: Run weekly (PostgreSQL)
  4. Monitor Growth: Track table sizes

Cleanup Queries

-- Archive old runs (example)
DELETE FROM baselinr_results
WHERE profiled_at < CURRENT_TIMESTAMP - INTERVAL '90 days';

DELETE FROM baselinr_runs
WHERE profiled_at < CURRENT_TIMESTAMP - INTERVAL '90 days';

-- Archive old events
DELETE FROM baselinr_events
WHERE timestamp < CURRENT_TIMESTAMP - INTERVAL '365 days';

Additional Resources


Maintained by: Baselinr Team
Questions? Open an issue on GitHub