Skip to main content

Baselinr Project Overview

๐Ÿ“ Complete Project Structureโ€‹

profile_mesh/
โ”‚
โ”œโ”€โ”€ baselinr/ # Main Python package
โ”‚ โ”œโ”€โ”€ __init__.py
โ”‚ โ”œโ”€โ”€ cli.py # Command-line interface
โ”‚ โ”‚
โ”‚ โ”œโ”€โ”€ config/ # Configuration management
โ”‚ โ”‚ โ”œโ”€โ”€ __init__.py
โ”‚ โ”‚ โ”œโ”€โ”€ schema.py # Pydantic models
โ”‚ โ”‚ โ””โ”€โ”€ loader.py # YAML/JSON config loader
โ”‚ โ”‚
โ”‚ โ”œโ”€โ”€ connectors/ # Database connectors
โ”‚ โ”‚ โ”œโ”€โ”€ __init__.py
โ”‚ โ”‚ โ”œโ”€โ”€ base.py # Abstract base connector
โ”‚ โ”‚ โ”œโ”€โ”€ postgres.py # PostgreSQL implementation
โ”‚ โ”‚ โ”œโ”€โ”€ snowflake.py # Snowflake implementation
โ”‚ โ”‚ โ””โ”€โ”€ sqlite.py # SQLite implementation
โ”‚ โ”‚
โ”‚ โ”œโ”€โ”€ profiling/ # Profiling engine
โ”‚ โ”‚ โ”œโ”€โ”€ __init__.py
โ”‚ โ”‚ โ”œโ”€โ”€ core.py # Main profiling orchestrator
โ”‚ โ”‚ โ””โ”€โ”€ metrics.py # Column-level metric calculator
โ”‚ โ”‚
โ”‚ โ”œโ”€โ”€ storage/ # Results storage
โ”‚ โ”‚ โ”œโ”€โ”€ __init__.py
โ”‚ โ”‚ โ”œโ”€โ”€ writer.py # Results writer
โ”‚ โ”‚ โ””โ”€โ”€ schema.sql # Storage schema DDL
โ”‚ โ”‚
โ”‚ โ”œโ”€โ”€ drift/ # Drift detection
โ”‚ โ”‚ โ”œโ”€โ”€ __init__.py
โ”‚ โ”‚ โ””โ”€โ”€ detector.py # Drift detector and reporter
โ”‚ โ”‚
โ”‚ โ””โ”€โ”€ integrations/
โ”‚ โ””โ”€โ”€ dagster/ # Dagster orchestration
โ”‚ โ”œโ”€โ”€ __init__.py
โ”‚ โ”œโ”€โ”€ assets.py # Asset factory
โ”‚ โ”œโ”€โ”€ sensors.py # Plan-aware sensor
โ”‚ โ””โ”€โ”€ events.py # Event emission
โ”‚
โ”œโ”€โ”€ examples/ # Example configurations
โ”‚ โ”œโ”€โ”€ config.yml # PostgreSQL config
โ”‚ โ”œโ”€โ”€ config_sqlite.yml # SQLite config
โ”‚ โ”œโ”€โ”€ dagster_repository.py # Dagster definitions
โ”‚ โ””โ”€โ”€ quickstart.py # Quickstart script
โ”‚
โ”œโ”€โ”€ docker/ # Docker development environment
โ”‚ โ”œโ”€โ”€ docker-compose.yml # Compose configuration
โ”‚ โ”œโ”€โ”€ Dockerfile # Application container
โ”‚ โ”œโ”€โ”€ init_postgres.sql # Database initialization
โ”‚ โ”œโ”€โ”€ dagster.yaml # Dagster instance config
โ”‚ โ””โ”€โ”€ workspace.yaml # Dagster workspace config
โ”‚
โ”œโ”€โ”€ tests/ # Test suite
โ”‚ โ”œโ”€โ”€ __init__.py
โ”‚ โ”œโ”€โ”€ test_config.py # Configuration tests
โ”‚ โ””โ”€โ”€ test_profiling.py # Profiling tests
โ”‚
โ”œโ”€โ”€ setup.py # Package setup (setuptools)
โ”œโ”€โ”€ pyproject.toml # Modern Python packaging config
โ”œโ”€โ”€ requirements.txt # Python dependencies
โ”œโ”€โ”€ Makefile # Development automation
โ”œโ”€โ”€ .gitignore # Git ignore patterns
โ”œโ”€โ”€ .dockerignore # Docker ignore patterns
โ”œโ”€โ”€ LICENSE # Apache License 2.0
โ”œโ”€โ”€ README.md # Main documentation
โ”œโ”€โ”€ docs/getting-started/QUICKSTART.md # Quick start guide
โ”œโ”€โ”€ DEVELOPMENT.md # Developer guide
โ”œโ”€โ”€ PROJECT_OVERVIEW.md # This file
โ””โ”€โ”€ MANIFEST.in # Package manifest

๐ŸŽฏ Key Features Implementedโ€‹

โœ… Phase 1 MVP Completeโ€‹

All Phase 1 requirements from the specification have been implemented:

1. Profiling Engine โœ“โ€‹

  • โœ… Profiles tables via SQLAlchemy
  • โœ… Collects schema metadata
  • โœ… Computes column metrics:
    • count, null %, distinct %
    • min, max, mean, stddev
    • histograms
    • string length statistics
  • โœ… Supports sampling
  • โœ… Outputs structured results (JSON + SQL)

2. Configuration System โœ“โ€‹

  • โœ… YAML/JSON configuration loader
  • โœ… Pydantic validation
  • โœ… Warehouse connection configuration
  • โœ… Table patterns (explicit or wildcard-ready)
  • โœ… Sampling configuration
  • โœ… Output destination configuration
  • โœ… Environment overrides via env vars

3. Storage Layer โœ“โ€‹

  • โœ… Results table with history
  • โœ… Schema includes:
    • dataset_name, column_name
    • metric_name, metric_value
    • profiled_at, run_id
  • โœ… Runs table for metadata
  • โœ… Automatic table creation

4. Execution Layer โœ“โ€‹

  • โœ… CLI command: baselinr profile --config config.yml
  • โœ… Dagster integration:
    • Dynamic asset factory
    • Configurable jobs
    • Event emission
    • Schedule definitions

5. Developer Environment โœ“โ€‹

  • โœ… Docker Compose setup with:
    • PostgreSQL (sample data + results)
    • Dagster daemon
    • Dagster web UI
  • โœ… Sample data generator (SQL seed script)
  • โœ… No-cost local setup
  • โœ… Sample tables: customers, products, orders

6. Drift Detection โœ“โ€‹

  • โœ… Compare two profile runs
  • โœ… Detect schema changes
  • โœ… Calculate metric differences
  • โœ… Severity classification (low/medium/high)
  • โœ… JSON output
  • โœ… Summary statistics

๐Ÿ“Š Supported Databasesโ€‹

DatabaseStatusNotes
PostgreSQLโœ… FullPrimary development target
SQLiteโœ… FullLightweight local testing
Snowflakeโœ… FullEnterprise data warehouse
MySQL๐Ÿ”ฒ EasyCan be added with connector
BigQuery๐Ÿ”ฒ EasyCan be added with connector
Redshift๐Ÿ”ฒ EasyCan be added with connector

๐Ÿ”ง Available Commandsโ€‹

CLI Commandsโ€‹

# Profile tables
baselinr profile --config config.yml [--output results.json] [--dry-run]

# Detect drift
baselinr drift --config config.yml --dataset <name> \
[--baseline <run-id>] [--current <run-id>] \
[--output report.json] [--fail-on-drift]

Makefile Commandsโ€‹

make help           # Show all commands
make install # Install Baselinr
make docker-up # Start Docker environment
make docker-down # Stop Docker environment
make quickstart # Run quickstart example
make test # Run tests
make format # Format code
make lint # Run linters

Python APIโ€‹

from baselinr.config.loader import ConfigLoader
from baselinr.profiling.core import ProfileEngine
from baselinr.storage.writer import ResultWriter
from baselinr.drift.detector import DriftDetector

# Load config
config = ConfigLoader.load_from_file("config.yml")

# Profile tables
engine = ProfileEngine(config)
results = engine.profile()

# Write results
writer = ResultWriter(config.storage)
writer.write_results(results)

# Detect drift
detector = DriftDetector(config.storage)
report = detector.detect_drift(dataset_name="customers")

๐Ÿš€ Getting Startedโ€‹

Choose your path:

1. Quick Test (5 minutes)โ€‹

cd profile_mesh
make docker-up
pip install -e ".[dagster]"
make quickstart

2. Full Setup (10 minutes)โ€‹

cd profile_mesh
make install-all
make docker-up
# Wait 30 seconds
baselinr profile --config examples/config.yml

3. Your Databaseโ€‹

  • Copy examples/config.yml
  • Update connection details
  • Add your tables
  • Run: baselinr profile --config your_config.yml

๐Ÿ“š Documentation Filesโ€‹

FilePurpose
README.mdMain documentation and feature overview
docs/getting-started/QUICKSTART.mdStep-by-step getting started guide
DEVELOPMENT.mdArchitecture and contribution guide
PROJECT_OVERVIEW.mdThis file - project structure

๐Ÿงช Testingโ€‹

# Run all tests
make test

# Run specific test file
pytest tests/test_config.py -v

# Run with coverage
pytest --cov=baselinr tests/

๐Ÿณ Docker Environmentโ€‹

The Docker environment includes:

  • PostgreSQL (port 5432)

    • Database: baselinr
    • User: baselinr
    • Password: baselinr
    • Sample tables pre-loaded
  • Dagster UI (port 3000)

๐Ÿ“ฆ Package Distributionโ€‹

Baselinr can be installed as:

# Basic installation
pip install baselinr

# With Snowflake support
pip install baselinr[snowflake]

# With Dagster orchestration
pip install baselinr[dagster]

# Full installation
pip install baselinr[all]

# Development mode
pip install -e ".[dev,all]"

๐ŸŽฏ Phase 1 Completion Criteria - STATUSโ€‹

All criteria from the specification are met:

โœ… CLI works: baselinr profile --config config.yml produces results
โœ… Dagster integration: Assets discoverable and runnable
โœ… Storage: Results written to structured tables
โœ… Drift detection: Can compare two profile runs

๐Ÿ”ฎ Future Enhancements (Post-MVP)โ€‹

Phase 2โ€‹

  • Web dashboard for visualization
  • Alert system (email, Slack, PagerDuty)
  • Additional database connectors
  • Enhanced drift detection (ML-based)
  • Data quality rules engine

Phase 3โ€‹

  • Column correlation analysis
  • PII detection
  • Data lineage tracking
  • Integration with data catalogs
  • Real-time profiling

๐Ÿ“„ Licenseโ€‹

Apache License 2.0 - see LICENSE file for details.

๐Ÿค Contributingโ€‹

Contributions welcome! See DEVELOPMENT.md for guidelines.


Baselinr v0.1.0 - MVP Complete โœ