Baselinr Project Overview
๐ Complete Project Structureโ
profile_mesh/
โ
โโโ baselinr/ # Main Python package
โ โโโ __init__.py
โ โโโ cli.py # Command-line interface
โ โ
โ โโโ config/ # Configuration management
โ โ โโโ __init__.py
โ โ โโโ schema.py # Pydantic models
โ โ โโโ loader.py # YAML/JSON config loader
โ โ
โ โโโ connectors/ # Database connectors
โ โ โโโ __init__.py
โ โ โโโ base.py # Abstract base connector
โ โ โโโ postgres.py # PostgreSQL implementation
โ โ โโโ snowflake.py # Snowflake implementation
โ โ โโโ sqlite.py # SQLite implementation
โ โ
โ โโโ profiling/ # Profiling engine
โ โ โโโ __init__.py
โ โ โโโ core.py # Main profiling orchestrator
โ โ โโโ metrics.py # Column-level metric calculator
โ โ
โ โโโ storage/ # Results storage
โ โ โโโ __init__.py
โ โ โโโ writer.py # Results writer
โ โ โโโ schema.sql # Storage schema DDL
โ โ
โ โโโ drift/ # Drift detection
โ โ โโโ __init__.py
โ โ โโโ detector.py # Drift detector and reporter
โ โ
โ โโโ integrations/
โ โโโ dagster/ # Dagster orchestration
โ โโโ __init__.py
โ โโโ assets.py # Asset factory
โ โโโ sensors.py # Plan-aware sensor
โ โโโ events.py # Event emission
โ
โโโ examples/ # Example configurations
โ โโโ config.yml # PostgreSQL config
โ โโโ config_sqlite.yml # SQLite config
โ โโโ dagster_repository.py # Dagster definitions
โ โโโ quickstart.py # Quickstart script
โ
โโโ docker/ # Docker development environment
โ โโโ docker-compose.yml # Compose configuration
โ โโโ Dockerfile # Application container
โ โโโ init_postgres.sql # Database initialization
โ โโโ dagster.yaml # Dagster instance config
โ โโโ workspace.yaml # Dagster workspace config
โ
โโโ tests/ # Test suite
โ โโโ __init__.py
โ โโโ test_config.py # Configuration tests
โ โโโ test_profiling.py # Profiling tests
โ
โโโ setup.py # Package setup (setuptools)
โโโ pyproject.toml # Modern Python packaging config
โโโ requirements.txt # Python dependencies
โโโ Makefile # Development automation
โโโ .gitignore # Git ignore patterns
โโโ .dockerignore # Docker ignore patterns
โโโ LICENSE # Apache License 2.0
โโโ README.md # Main documentation
โโโ docs/getting-started/QUICKSTART.md # Quick start guide
โโโ DEVELOPMENT.md # Developer guide
โโโ PROJECT_OVERVIEW.md # This file
โโโ MANIFEST.in # Package manifest
๐ฏ Key Features Implementedโ
โ Phase 1 MVP Completeโ
All Phase 1 requirements from the specification have been implemented:
1. Profiling Engine โโ
- โ Profiles tables via SQLAlchemy
- โ Collects schema metadata
- โ
Computes column metrics:
- count, null %, distinct %
- min, max, mean, stddev
- histograms
- string length statistics
- โ Supports sampling
- โ Outputs structured results (JSON + SQL)
2. Configuration System โโ
- โ YAML/JSON configuration loader
- โ Pydantic validation
- โ Warehouse connection configuration
- โ Table patterns (explicit or wildcard-ready)
- โ Sampling configuration
- โ Output destination configuration
- โ Environment overrides via env vars
3. Storage Layer โโ
- โ Results table with history
- โ
Schema includes:
- dataset_name, column_name
- metric_name, metric_value
- profiled_at, run_id
- โ Runs table for metadata
- โ Automatic table creation
4. Execution Layer โโ
- โ
CLI command:
baselinr profile --config config.yml - โ
Dagster integration:
- Dynamic asset factory
- Configurable jobs
- Event emission
- Schedule definitions
5. Developer Environment โโ
- โ
Docker Compose setup with:
- PostgreSQL (sample data + results)
- Dagster daemon
- Dagster web UI
- โ Sample data generator (SQL seed script)
- โ No-cost local setup
- โ Sample tables: customers, products, orders
6. Drift Detection โโ
- โ Compare two profile runs
- โ Detect schema changes
- โ Calculate metric differences
- โ Severity classification (low/medium/high)
- โ JSON output
- โ Summary statistics
๐ Supported Databasesโ
| Database | Status | Notes |
|---|---|---|
| PostgreSQL | โ Full | Primary development target |
| SQLite | โ Full | Lightweight local testing |
| Snowflake | โ Full | Enterprise data warehouse |
| MySQL | ๐ฒ Easy | Can be added with connector |
| BigQuery | ๐ฒ Easy | Can be added with connector |
| Redshift | ๐ฒ Easy | Can be added with connector |
๐ง Available Commandsโ
CLI Commandsโ
# Profile tables
baselinr profile --config config.yml [--output results.json] [--dry-run]
# Detect drift
baselinr drift --config config.yml --dataset <name> \
[--baseline <run-id>] [--current <run-id>] \
[--output report.json] [--fail-on-drift]
Makefile Commandsโ
make help # Show all commands
make install # Install Baselinr
make docker-up # Start Docker environment
make docker-down # Stop Docker environment
make quickstart # Run quickstart example
make test # Run tests
make format # Format code
make lint # Run linters
Python APIโ
from baselinr.config.loader import ConfigLoader
from baselinr.profiling.core import ProfileEngine
from baselinr.storage.writer import ResultWriter
from baselinr.drift.detector import DriftDetector
# Load config
config = ConfigLoader.load_from_file("config.yml")
# Profile tables
engine = ProfileEngine(config)
results = engine.profile()
# Write results
writer = ResultWriter(config.storage)
writer.write_results(results)
# Detect drift
detector = DriftDetector(config.storage)
report = detector.detect_drift(dataset_name="customers")
๐ Getting Startedโ
Choose your path:
1. Quick Test (5 minutes)โ
cd profile_mesh
make docker-up
pip install -e ".[dagster]"
make quickstart
2. Full Setup (10 minutes)โ
cd profile_mesh
make install-all
make docker-up
# Wait 30 seconds
baselinr profile --config examples/config.yml
3. Your Databaseโ
- Copy
examples/config.yml - Update connection details
- Add your tables
- Run:
baselinr profile --config your_config.yml
๐ Documentation Filesโ
| File | Purpose |
|---|---|
| README.md | Main documentation and feature overview |
| docs/getting-started/QUICKSTART.md | Step-by-step getting started guide |
| DEVELOPMENT.md | Architecture and contribution guide |
| PROJECT_OVERVIEW.md | This file - project structure |
๐งช Testingโ
# Run all tests
make test
# Run specific test file
pytest tests/test_config.py -v
# Run with coverage
pytest --cov=baselinr tests/
๐ณ Docker Environmentโ
The Docker environment includes:
-
PostgreSQL (port 5432)
- Database:
baselinr - User:
baselinr - Password:
baselinr - Sample tables pre-loaded
- Database:
-
Dagster UI (port 3000)
- http://localhost:3000
- Pre-configured with Baselinr assets
- Daily schedule for profiling
๐ฆ Package Distributionโ
Baselinr can be installed as:
# Basic installation
pip install baselinr
# With Snowflake support
pip install baselinr[snowflake]
# With Dagster orchestration
pip install baselinr[dagster]
# Full installation
pip install baselinr[all]
# Development mode
pip install -e ".[dev,all]"
๐ฏ Phase 1 Completion Criteria - STATUSโ
All criteria from the specification are met:
โ
CLI works: baselinr profile --config config.yml produces results
โ
Dagster integration: Assets discoverable and runnable
โ
Storage: Results written to structured tables
โ
Drift detection: Can compare two profile runs
๐ฎ Future Enhancements (Post-MVP)โ
Phase 2โ
- Web dashboard for visualization
- Alert system (email, Slack, PagerDuty)
- Additional database connectors
- Enhanced drift detection (ML-based)
- Data quality rules engine
Phase 3โ
- Column correlation analysis
- PII detection
- Data lineage tracking
- Integration with data catalogs
- Real-time profiling
๐ Licenseโ
Apache License 2.0 - see LICENSE file for details.
๐ค Contributingโ
Contributions welcome! See DEVELOPMENT.md for guidelines.
Baselinr v0.1.0 - MVP Complete โ