DataCheck
Data validation engine for data engineers. Define validation rules in YAML, run checks on files, databases, and cloud warehouses from your terminal.
pip install datacheck-cliDataCheck provides the datacheck CLI and a Python API to validate data, profile quality, and detect schema changes. Run it locally during development, embed it in pipelines (Airflow, Dagster, Prefect), or integrate it into CI/CD workflows.
Installation
Requirements
- Python 3.10, 3.11, or 3.12
- pip 21.0 or greater
Install
pip install datacheck-cliInstall with extras
Install only the connectors you need:
# Databases
pip install datacheck-cli[postgresql]
pip install datacheck-cli[mysql]
pip install datacheck-cli[mssql]
# Cloud warehouses
pip install datacheck-cli[snowflake]
pip install datacheck-cli[bigquery]
pip install datacheck-cli[redshift]
pip install datacheck-cli[warehouses] # All three warehouses
# Cloud storage
pip install datacheck-cli[cloud] # S3, GCS, Azure Blob
# File formats
pip install datacheck-cli[deltalake]
pip install datacheck-cli[avro]
pip install datacheck-cli[duckdb]
# Statistical rules
pip install datacheck-cli[statistical]
# Everything
pip install datacheck-cli[all]Verify
datacheck versionQuickstart
1. Generate a config with sample data
datacheck config init --with-sample-dataThis creates a datacheck.yaml config file and a sample CSV file. Use --template to pick an industry template:
datacheck config init --template ecommerce --with-sample-data2. Run validation
datacheck validateDataCheck auto-discovers config files in this order: .datacheck.yaml → .datacheck.yml → datacheck.yaml → datacheck.yml. To specify a config explicitly:
datacheck validate --config checks.yaml3. Minimal config example
# .datacheck.yaml
data_source:
type: csv
path: ./data/orders.csv
checks:
- name: id_check
column: id
rules:
not_null: true
unique: true
- name: amount_check
column: amount
rules:
not_null: true
min: 0
max: 10000
- name: email_check
column: email
rules:
email_valid: trueConfiguration
Config file structure
A .datacheck.yaml file can contain:
# Data source (inline, for file-based sources)
data_source:
type: csv
path: ./data/orders.csv
options:
delimiter: ","
encoding: utf-8
# Or reference named sources
sources_file: sources.yaml
source: production_db
table: orders
# Validation checks
checks:
- name: id_check
column: id
rules:
not_null: true
unique: true
severity: error # error (default), warning, info
enabled: true # default: true
# Custom rule plugins
plugins:
- ./custom_rules.py
# Config inheritance
extends: base.yaml
# Reporting
reporting:
output_path: ./reports
export_failures: true
failures_file: failures.csv
# Notifications
notifications:
slack_webhook: "${SLACK_WEBHOOK}"
mention_on_failure: true
# Sampling
sampling:
strategy: random
params:
sample_rate: 0.1Checks definition
Each check targets a column and applies one or more rules:
checks:
- name: order_amount # Rule identifier
column: amount # Target column
rules:
not_null: true # Rule type → parameters
min: 0
max: 100000
severity: error # error (default), warning, info
enabled: true # Toggle check on/off
- name: warehouse_orders
column: total
source: snowflake_wh # Override source for this check
table: orders # Override table for this check
rules:
min: 0Severity levels
| Severity | Effect |
|---|---|
error (default) | Causes exit code 1 on failure |
warning | Reported but does not fail the run |
info | Informational only |
Only error-severity failures cause a non-zero exit code.
Environment variables
Config files support environment variable substitution:
sources:
production_db:
type: postgresql
host: ${DB_HOST} # Required — fails if not set
port: ${DB_PORT:-5432} # Optional — uses default 5432
database: ${DB_NAME}
user: ${DB_USER}
password: ${DB_PASSWORD}Use datacheck config env to list all variables referenced in a config and their current values:
datacheck config env datacheck.yamlConfig inheritance
Use extends to inherit rules from a base config and override or add checks per environment:
# base.yaml — shared rules
data_source:
type: csv
path: ./data/orders.csv
checks:
- name: id_check
column: id
rules:
not_null: true
unique: true# production.yaml — inherits base, adds stricter rules
extends: base.yaml
checks:
- name: amount_check
column: amount
rules:
min: 0
max: 50000
severity: errorConfig validation
Check config for errors before running:
datacheck config validate
datacheck config validate datacheck.yaml --strict # Fail on warnings tooAuto-generate config from data
Analyze a data file and generate validation rules automatically:
datacheck config generate data.csv
datacheck config generate data.csv --confidence high
datacheck config generate data.csv -o custom.yamlOptions:
| Flag | Description |
|---|---|
--confidence / -c | Minimum confidence threshold: low, medium (default), high |
--output / -o | Output config file path (default: datacheck.yaml) |
--name / -n | Dataset name (default: derived from filename) |
--force / -f | Overwrite existing config file |
The generated config includes:
- Type inference: Correctly distinguishes
int,numeric,bool,date, andstringtypes - Regex patterns: Auto-detected patterns for IDs, URLs, dates, etc. using
[0-9]character classes (not\d) for cross-language compatibility - Statistical rules:
mean_between,std_dev_less_than,percentile_rangewith thresholds derived from data - Semantic rules:
email_valid,phone_valid,url_valid,json_validbased on column name detection - Cross-column rules:
sum_equalsauto-detected when two numeric columns sum to a third - Temporal rules:
timestamp_rangewith 1-day margin,no_future_timestamps,date_formatwith detected format string - Reporting block: Includes
output_pathandexport_failuressettings - Data source block: Includes file type, path, and
options(delimiter, encoding, etc.)
Config validation error reporting
datacheck config validate reports all errors at once instead of stopping at the first one. This includes schema errors, missing fields (name, column, rules), and invalid rule definitions:
datacheck config validate checks.yaml
# Configuration has errors:
# - Check #2: Missing required field 'column'
# - Check #5: Missing required field 'rules'
# - Schema validation failed at 'checks.3.rules.min': -1 is not validShow resolved config
Display the fully resolved configuration with env vars and inheritance applied:
datacheck config show
datacheck config show datacheck.yaml --format json
datacheck config show --no-resolve-env
datacheck config show --no-resolve-extendsMerge configs
Merge multiple configuration files. Later files override values from earlier files:
datacheck config merge base.yaml production.yaml
datacheck config merge base.yaml prod.yaml -o merged.yamlList templates
Show all available templates with descriptions:
datacheck config templatesData Sources
File sources (inline in config)
CSV
data_source:
type: csv
path: ./data/orders.csv
options:
delimiter: ","
encoding: utf-8Parquet
data_source:
type: parquet
path: ./data/orders.parquetAvro (requires pip install datacheck-cli[avro])
data_source:
type: avro
path: ./data/orders.avroDelta Lake (requires pip install datacheck-cli[deltalake])
data_source:
type: delta
path: ./data/delta-tableDelta Lake supports time travel:
datacheck validate --delta-version 5
datacheck validate --delta-timestamp "2026-01-15T10:00:00"
datacheck validate --storage-options '{"AWS_ACCESS_KEY_ID": "..."}'SQLite
data_source:
type: sqlite
path: ./data/analytics.dbDuckDB (requires pip install datacheck-cli[duckdb])
data_source:
type: duckdb
path: ./data/analytics.duckdbDatabase sources (named sources)
For databases, define named sources in a sources.yaml file:
# sources.yaml
sources:
production_db:
type: postgresql
host: ${DB_HOST}
port: ${DB_PORT:-5432}
database: ${DB_NAME}
user: ${DB_USER}
password: ${DB_PASSWORD}
schema: public
mysql_db:
type: mysql
host: ${MYSQL_HOST}
port: ${MYSQL_PORT:-3306}
database: ${MYSQL_DB}
user: ${MYSQL_USER}
password: ${MYSQL_PASSWORD}
mssql_db:
type: mssql
host: ${MSSQL_HOST}
port: ${MSSQL_PORT:-1433}
database: ${MSSQL_DB}
user: ${MSSQL_USER}
password: ${MSSQL_PASSWORD}Cloud warehouse sources
# sources.yaml
sources:
snowflake_wh:
type: snowflake
account: ${SF_ACCOUNT}
user: ${SF_USER}
password: ${SF_PASSWORD}
warehouse: ${SF_WAREHOUSE:-COMPUTE_WH}
database: ${SF_DATABASE}
schema: ${SF_SCHEMA:-PUBLIC}
role: ${SF_ROLE}
# SSO: authenticator: externalbrowser
# Key pair: private_key_path: /path/to/key.p8
bigquery_ds:
type: bigquery
project_id: ${GCP_PROJECT}
dataset_id: ${GCP_DATASET}
credentials_path: /path/to/service-account.json
location: US
redshift_db:
type: redshift
host: ${REDSHIFT_HOST}
port: ${REDSHIFT_PORT:-5439}
database: ${REDSHIFT_DB}
user: ${REDSHIFT_USER}
password: ${REDSHIFT_PASSWORD}
schema: public
# IAM auth: cluster_identifier, region, iam_auth: trueSnowflake, BigQuery, and Redshift support server-side filtering and sampling — WHERE clauses, LIMIT, and TABLESAMPLE execute on the warehouse to minimize data transfer before validation runs locally.
Cloud storage sources
# sources.yaml
sources:
s3_data:
type: s3
bucket: my-bucket
path: data/orders.csv
region: us-east-1
access_key: ${AWS_ACCESS_KEY_ID}
secret_key: ${AWS_SECRET_ACCESS_KEY}
gcs_data:
type: gcs
bucket: my-bucket
path: data/orders.csv
credentials_path: /path/to/service-account.json
azure_data:
type: azure
container: my-container
path: data/orders.csv
connection_string: ${AZURE_STORAGE_CONNECTION_STRING}
# Or: account_name + account_keyConnection strings
You can also pass connection strings directly to the CLI:
datacheck validate postgresql://user:pass@host:5432/db --table orders
datacheck validate mysql://user:pass@host:3306/db --table orders
datacheck validate mssql://user:pass@host:1433/database --table orders
datacheck validate snowflake://account/database/schema --table orders
datacheck validate bigquery://project/dataset --table orders
datacheck validate redshift://user:pass@host:5439/database/schema --table ordersNamed sources and per-check overrides
Reference a named source in your config:
# .datacheck.yaml
sources_file: sources.yaml
source: production_db
table: orders
checks:
- name: customer_email
column: email
rules:
not_null: true
- name: order_total
column: total
source: snowflake_wh # Override source for this check
table: orders
rules:
min: 0Switch sources at runtime:
datacheck validate --source snowflake_wh --config checks.yaml
datacheck validate --source s3_data --sources-file sources.yamlConnection pre-validation
When validating against database sources, DataCheck tests connectivity for all referenced sources before running any validation rules. If multiple sources are unreachable, all connection errors are reported together:
Source connectivity check failed:
- Source 'production_db' (postgresql): Connection failed — could not connect to server
- Source 'analytics_wh' (snowflake): Connection failed — invalid credentialsFor file-based sources, DataCheck verifies the file exists before validation begins.
SQL filtering
Use --table, --where, and --query for server-side filtering:
datacheck validate --source production_db --table orders --where "status = 'active'"
datacheck validate --source production_db --query "SELECT * FROM orders WHERE created_at > '2026-01-01'"Validation Rules
Null and uniqueness
| Rule | YAML Syntax | Description |
|---|---|---|
not_null | not_null: true | No null or missing values |
unique | unique: true | No duplicate values (nulls ignored) |
unique_combination | unique_combination: [col1, col2] | Composite uniqueness across columns |
Numeric
| Rule | YAML Syntax | Description |
|---|---|---|
min | min: 0 | Column >= value |
max | max: 10000 | Column <= value |
mean_between | mean_between: {min: 10, max: 50} | Column mean within range |
std_dev_less_than | std_dev_less_than: 5.0 | Standard deviation below threshold |
percentile_range | percentile_range: {p25_min: 10, p25_max: 20, p75_min: 80, p75_max: 90} | 25th and 75th percentile bounds |
z_score_outliers | z_score_outliers: 3.0 | Detect outliers by z-score (default threshold: 3.0) |
distribution_type | distribution_type: 'normal' | Validate distribution shape — normal or uniform (uses KS test) |
String and pattern
| Rule | YAML Syntax | Description |
|---|---|---|
regex | regex: '^[A-Z]{2}[0-9]{4}$' | Match regex pattern |
allowed_values | allowed_values: [active, inactive, pending] | Value in allowed set |
type | type: 'string' | Data type check (int, numeric, string, bool, date, datetime) |
length | length: {min: 1, max: 100} | String length constraints |
min_length | min_length: 1 | Minimum string length |
max_length | max_length: 255 | Maximum string length |
Temporal
| Rule | YAML Syntax | Description |
|---|---|---|
max_age | max_age: '24h' | Data freshness — supports h (hours), d (days), w (weeks), m (minutes) |
timestamp_range | timestamp_range: {min: "2025-01-01", max: "2026-12-31"} | Timestamps within range (ISO format) |
date_range | date_range: {min: "2025-01-01", max: "2026-12-31"} | Alias for timestamp_range |
no_future_timestamps | no_future_timestamps: true | No timestamps beyond current time |
date_format_valid | date_format_valid: '%Y-%m-%d' | Validates date format (Python strftime) |
date_format | date_format: {format: '%Y-%m-%d'} | Alias for date_format_valid (dict form) |
business_days_only | business_days_only: 'US' | Weekdays only — pass country code (e.g., 'US', 'GB') or true for default |
Semantic and format
| Rule | YAML Syntax | Description |
|---|---|---|
email_valid | email_valid: true | RFC 5322 email format (two-stage: regex pre-filter + email-validator library) |
phone_valid | phone_valid: 'US' | Phone number format (phonenumbers library, supports all countries; pass country code or true) |
url_valid | url_valid: true | URL structure validation |
json_valid | json_valid: true | Valid JSON parsing |
Cross-column and relationships
| Rule | YAML Syntax | Description |
|---|---|---|
unique_combination | unique_combination: [col1, col2] | Composite uniqueness across multiple columns |
foreign_key_exists | Python API | Foreign key validation against a reference DataFrame (use Python API to pass live data) |
sum_equals | sum_equals: {column_a: col1, column_b: col2} | Verify column equals sum of two other columns (with optional tolerance) |
Example: complete config with rules
data_source:
type: csv
path: ./data/orders.csv
checks:
- name: id_not_null
column: id
rules:
not_null: true
unique: true
- name: amount_range
column: amount
rules:
not_null: true
min: 0
max: 100000
z_score_outliers:
threshold: 3.0
severity: error
- name: email_format
column: email
rules:
email_valid: true
severity: warning
- name: order_date
column: created_at
rules:
no_future_timestamps: true
max_age: '30d'
date_format_valid: '%Y-%m-%d %H:%M:%S'
- name: status_values
column: status
rules:
allowed_values:
- pending
- confirmed
- shipped
- delivered
- cancelledCustom Rules
Creating custom rules
Create a Python file with functions decorated with @custom_rule. Each function receives a pd.Series and optional parameters, and returns a boolean pd.Series where True means valid:
# custom_rules.py
from datacheck.plugins.decorators import custom_rule
import pandas as pd
@custom_rule
def is_business_email(column: pd.Series, allowed_domains: list) -> pd.Series:
"""Validate that emails use approved business domains."""
domains = column.dropna().str.split("@").str[1]
return domains.isin(allowed_domains)
@custom_rule
def is_positive_margin(column: pd.Series, min_margin: float = 0.0) -> pd.Series:
"""Validate profit margin is above threshold."""
return column.dropna() >= min_marginReferencing plugins in config
plugins:
- ./custom_rules.py
checks:
- name: email_domain
column: email
rules:
custom:
rule: is_business_email
params:
allowed_domains: ["company.com", "corp.com"]
- name: margin_check
column: profit_margin
rules:
custom:
rule: is_positive_margin
params:
min_margin: 0.05Plugin registry
load_from_file()imports the Python module and registers all@custom_ruledecorated functions- Registered rules become available through the
RuleFactoryalongside built-in rules - The global registry tracks all loaded custom rules
Data Profiling
Running profiling
# Direct file path
datacheck profile data.csv
# Auto-discover config
datacheck profile
# Explicit config file
datacheck profile --config checks.yaml
# Named source
datacheck profile --source production_db --sources-file sources.yaml
# Named source with table
datacheck profile --source production_db --table ordersProfile options
| Flag | Description |
|---|---|
--format / -f | Output format: terminal (default), json, markdown |
--output / -o | Write output to file |
--outlier-method | Outlier detection method: zscore (default) or iqr |
--suggestions / --no-suggestions | Show rule suggestions (default: enabled) |
--correlations / --no-correlations | Show correlation matrix |
datacheck profile data.csv --format json -o profile.json
datacheck profile --outlier-method iqr --correlations
datacheck profile --format markdown -o report.mdWhat profiling computes
- Basic counts: total rows, null count, unique count, duplicate count, completeness percentage
- Numeric statistics: min, max, mean, median, standard deviation, 25th/50th/75th percentiles
- Value distributions: top N values with counts
- Outlier detection: Z-score method (|z| > 3.0) or IQR method (values outside Q1-1.5*IQR to Q3+1.5*IQR)
- Correlation matrix: Pearson correlation between all numeric columns
- Quality scoring: 0-100 score per column and per dataset
Quality scoring
Each column receives a 0-100 quality score based on:
| Factor | What it measures |
|---|---|
| Completeness | Penalizes null/missing values |
| Uniqueness | Penalizes duplicate values |
| Validity | Type consistency across the column |
| Consistency | Low variance in categorical columns |
The dataset score is a weighted average of all column scores.
Rule suggestions
The profiler automatically suggests validation rules based on data patterns:
- Numeric columns: range rules, outlier thresholds, distribution checks, type (
intvsnumeric) - String columns: length constraints, regex patterns, allowed value sets
- Temporal columns: date format detection, timestamp ranges (with margin),
no_future_timestamps - Semantic columns:
email_valid,phone_valid,url_valid,json_validinferred from column names and content - Cross-column:
sum_equalsauto-detected when two numeric columns sum to a third - All columns: null checks, uniqueness rules
Schema Detection and Evolution
Commands
datacheck schema capture # Save current schema as baseline
datacheck schema compare # Compare current data against baseline
datacheck schema show # Display detected schema
datacheck schema list # List all saved baselines
datacheck schema history # View capture historySchema capture
datacheck schema capture data.csv
datacheck schema capture --source production_db --sources-file sources.yaml
datacheck schema capture --name v2-baseline
datacheck schema capture --baseline-dir ./schemas
datacheck schema capture --no-history| Flag | Description |
|---|---|
--name / -n | Baseline name (default: baseline) |
--baseline-dir | Storage directory (default: .datacheck/schemas/) |
--save-history / --no-history | Save to history (default: enabled) |
Schema compare
datacheck schema compare data.csv
datacheck schema compare --baseline v2-baseline
datacheck schema compare --fail-on-breaking
datacheck schema compare --rename-threshold 0.9
datacheck schema compare --format json| Flag | Description |
|---|---|
--baseline / -b | Baseline name to compare against (default: baseline) |
--rename-threshold | Similarity threshold for rename detection (0.0-1.0, default: 0.8) |
--fail-on-breaking | Exit with code 1 on breaking changes |
--format / -f | Output format: terminal (default) or json |
Schema compare exit codes
| Code | Meaning |
|---|---|
| 0 | Compatible — no breaking changes |
| 1 | Breaking changes detected (with --fail-on-breaking) |
| 2 | Baseline not found |
| 3 | Data load error |
| 4 | Unexpected error |
What schema tracks
For each column: name, data type, nullable status, position, unique value count, null percentage. For the dataset: row count, source identifier, capture timestamp.
Change types detected
| Change | Compatibility Level |
|---|---|
| Column added | COMPATIBLE |
| Column removed | BREAKING |
| Column renamed | WARNING |
| Nullable changed | WARNING |
| Order changed | COMPATIBLE |
Type change compatibility
Compatible changes (widening): int→float, int→string, float→string, bool→string, date→datetime, date→string, datetime→string
Breaking changes (narrowing): float→int, string→int, string→float, string→bool, datetime→date, string→datetime, string→date
Baseline storage
- Baselines are stored as JSON files in
.datacheck/schemas/ - History entries are stored in
.datacheck/schemas/history/with timestamps (e.g.schema_20260212_143000.json) - Use
datacheck schema listto see all baselines - Use
datacheck schema history --limit 20to see recent history
Sampling Strategies
Available strategies
| Strategy | Description | Key Parameters |
|---|---|---|
random | Simple random sampling | sample_rate or sample_count, seed |
stratified | Preserve value distributions across groups | stratify_column, min_per_stratum |
time_based | Sample within a time window | time_column, start_date, end_date |
error_focused | Prioritize rows matching error conditions | error_conditions (e.g. ['age<0', 'price>10000']) |
adaptive | Adjust sample size based on data characteristics | target_quality, initial_size |
reservoir | Single-pass sampling for streaming data | sample_count |
systematic | Every Nth row | sample_rate |
top_n | First N rows | --top N |
CLI sampling flags
# Random sampling
datacheck validate --sample-rate 0.1 # 10% of rows
datacheck validate --sample-count 1000 # Exactly 1000 rows
datacheck validate --sample-count 1000 --seed 42 # Reproducible
# First N rows
datacheck validate --top 500
# Strategy-based
datacheck validate --sample-strategy stratified --stratify region
datacheck validate --sample-strategy time_based --time-column created_at --start-date 2026-01-01 --end-date 2026-02-01
datacheck validate --sample-strategy error_focused --error-indicators "age<0,price>10000"| Flag | Description |
|---|---|
--sample-rate | Fraction to sample (0.0-1.0) |
--sample-count | Exact number of rows to sample |
--top | First N rows only |
--sample-strategy | Strategy name: random, stratified, time_based, error_focused, adaptive, reservoir |
--stratify | Column for stratified sampling |
--seed | Random seed for reproducibility |
--time-column | Column for time-based sampling |
--start-date | Start date (ISO format) |
--end-date | End date (ISO format) |
--error-indicators | Comma-separated error conditions |
CLI Command Reference
datacheck validate
Run validation against data files or databases.
Data source flags:
| Flag | Description |
|---|---|
data_source (positional) | File path or connection string |
--config / -c | Path to validation config YAML |
--source | Named source from sources.yaml |
--sources-file | Path to sources YAML file |
--table / -t | Database table name |
--where / -w | SQL WHERE clause for filtering |
--query / -q | Custom SQL query |
--schema / -s | Schema/dataset name |
Warehouse-specific flags:
| Flag | Description |
|---|---|
--warehouse | Snowflake warehouse name |
--credentials | Path to credentials file (BigQuery service account) |
--region | AWS region (Redshift IAM auth) |
--cluster | Cluster identifier (Redshift IAM auth) |
--iam-auth | Use IAM authentication (Redshift) |
Delta Lake flags:
| Flag | Description |
|---|---|
--delta-version | Delta Lake version to load (time travel) |
--delta-timestamp | Timestamp to load data as of (ISO 8601) |
--storage-options | JSON string of storage options for cloud access |
Sampling flags: See Sampling Strategies.
Execution flags:
| Flag | Description |
|---|---|
--parallel | Enable multi-core parallel execution |
--workers | Number of worker processes (default: CPU count) |
--chunk-size | Rows per chunk for parallel processing (default: 10,000) |
--progress / --no-progress | Show/hide progress bar |
Output flags:
| Flag | Description |
|---|---|
--output / -o | Save results to a JSON file |
--csv-export | Export failure details as CSV |
--suggestions / --no-suggestions | Show improvement suggestions (default: enabled) |
--slack-webhook | Slack webhook URL for notifications |
Logging flags:
| Flag | Description |
|---|---|
--log-level | Log level: DEBUG, INFO, WARNING, ERROR, CRITICAL |
--log-format | Log format: console (human-readable) or json (machine-parseable) |
--log-file | Path to log file (with automatic rotation) |
--verbose / -v | Shortcut for --log-level DEBUG |
datacheck profile
Generate data quality profiles with statistics, quality scores, and rule suggestions.
Same data source flags as validate, plus:
| Flag | Description |
|---|---|
--format / -f | Output format: terminal (default), json, markdown |
--output / -o | Write output to file |
--outlier-method | Detection method: zscore (default) or iqr |
--suggestions / --no-suggestions | Show rule suggestions |
--correlations / --no-correlations | Show correlation matrix |
datacheck config
Configuration management commands.
| Subcommand | Description |
|---|---|
config init | Generate config from template |
config init --template <name> | Use specific template (basic, ecommerce, healthcare, finance, saas, iot, rules-reference, sources) |
config init --with-sample-data | Also generate a sample CSV file |
config init --sample-rows N | Number of sample rows to generate (default: 100) |
config init --force | Overwrite existing config file |
config validate <file> | Validate config file syntax and rule definitions |
config validate --strict | Fail on warnings too |
config show <file> | Show fully resolved config (env vars + inheritance applied) |
config show --format yaml/json | Output format |
config show --no-resolve-env | Skip environment variable resolution |
config show --no-resolve-extends | Skip config inheritance resolution |
config merge <files...> | Merge multiple configs (later files override earlier) |
config merge -o output.yaml | Write merged result to file |
config generate <file> | Auto-generate rules from data analysis |
config generate --confidence | Minimum confidence: low, medium (default), high |
config templates | List available templates with descriptions |
config env <file> | Show environment variables referenced in config |
datacheck schema
Schema evolution detection and management.
| Subcommand | Description |
|---|---|
schema capture | Save current schema as baseline |
schema compare | Compare current data against baseline |
schema show | Display detected schema (columns, types, nullable, stats) |
schema list | List all saved baseline schemas |
schema history | View capture history (newest first) |
datacheck version
Display version information.
Exit codes
| Code | Meaning |
|---|---|
| 0 | All rules passed (or only warning/info severity failures) |
| 1 | Some error-severity rules failed |
| 2 | Configuration error |
| 3 | Data loading error |
| 4 | Unexpected error |
Output and Reporting
Terminal output
DataCheck uses Rich-formatted terminal output with color-coded results:
- Green: Passed rules
- Red: Failed rules
- Yellow: Errors during rule execution
Output includes a statistics table (records, columns, rules, pass/fail counts), detailed failure tables (check name, column, failure count, sample values), and actionable improvement suggestions.
JSON export
datacheck validate --output results.jsonExports full validation results in machine-readable JSON format, including all rule results, failure details, and summary statistics. Use this for automation and CI/CD integration.
CSV export
datacheck validate --csv-export failures.csvExports failure details as CSV with columns: check_name, column, severity, failed_rows, reason, suggestion.
Markdown reports
datacheck profile --format markdown -o report.mdGenerates markdown-formatted profile reports with tables, statistics, and quality scores.
Slack notifications
Configure the webhook in your config file so you don't need to pass it every time:
notifications:
slack_webhook: "${SLACK_WEBHOOK}"
mention_on_failure: true # @channel on failures (default: false)Or pass it via the CLI (overrides the config value):
datacheck validate --slack-webhook https://hooks.slack.com/services/...Sends validation results to Slack with:
- Color-coded messages (green for pass, red for fail)
- Summary statistics and failed rules
- Optional
@channelmention on failures (viamention_on_failure) - Up to 5 failed rule details with row counts
Parallel Execution and Performance
Enabling parallel mode
datacheck validate --parallel
datacheck validate --parallel --workers 4
datacheck validate --parallel --chunk-size 50000
datacheck validate --parallel --progress| Flag | Description |
|---|---|
--parallel | Enable multi-core parallel execution |
--workers | Number of worker processes (default: CPU count) |
--chunk-size | Rows per chunk (default: 10,000) |
--progress / --no-progress | Show/hide progress bar |
How parallel execution works
- Splits the DataFrame into chunks based on
--chunk-size - Processes chunks in parallel using
multiprocessing.Pool - Aggregates results across chunks (combines pass/fail counts, merges failure details)
- Automatically falls back to sequential execution for small datasets
- Shows a Rich progress bar with spinner, elapsed time, and remaining time
Performance features
- PyArrow backend: Vectorized operations for faster validation (e.g. fast null count via Arrow)
- Lazy loading: Cloud connectors are loaded only when needed — no unnecessary dependencies
- Memory optimization: Memory-aware chunk sizing, worker auto-scaling, and large file handling
- Caching: Regex compilation caching (
@lru_cache) and compute-once patterns for expensive operations - Vectorized rules: NumPy/Pandas vectorized operations — no Python loops in hot paths
Logging
Log configuration
datacheck validate --verbose # DEBUG level
datacheck validate --log-level WARNING # Specific level
datacheck validate --log-format json # Machine-parseable JSON logs
datacheck validate --log-file validation.log # Log to file (with rotation)
datacheck validate --log-level DEBUG --log-format json --log-file debug.log| Flag | Description |
|---|---|
--log-level | DEBUG, INFO, WARNING, ERROR, CRITICAL |
--log-format | console (human-readable, default) or json (machine-parseable) |
--log-file | Path to log file (automatic rotation) |
--verbose / -v | Shortcut for --log-level DEBUG |
Logging features
- Structured logging: Console and JSON formatters for different use cases
- Sensitive data masking: Automatically masks credentials and passwords in log output
- Trace IDs: Unique trace ID per validation run for log correlation across systems
- File rotation: Automatic log file rotation to prevent unbounded growth
Security
Credential handling
- Environment variables: Use
${VAR}and${VAR:-default}syntax in config files — never hardcode credentials - Credential files: Load credentials from external files
- Password masking: Credentials are automatically masked in logs and terminal output
- Config env audit: Use
datacheck config envto verify all required variables are set
Connection security
- Connection string validation before attempting connections
- SQL injection prevention: table name validation, WHERE clause scanning, parameterized queries
- Path traversal prevention with null byte and symlink detection
- SSL/TLS enforcement for warehouse connections
Airflow Integration
DataCheck provides two Airflow operators for use in DAGs, plus a simpler BashOperator pattern.
DataCheckOperator
Run data validation inside Airflow DAGs:
from datacheck.airflow.operators import DataCheckOperator
validate_orders = DataCheckOperator(
task_id="validate_orders",
config_path="/path/to/datacheck.yaml",
file_path="/data/orders.csv",
fail_on_error=True,
push_results=True,
min_pass_rate=95.0,
)Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
config_path | str | required | Path to validation config YAML |
file_path | str | None | Path to data file (CSV, Parquet, Avro, Delta, etc.) |
sources_file | str | None | Path to sources YAML (overrides config) |
source_name | str | None | Named source from sources.yaml |
table | str | None | Database table name |
where | str | None | SQL WHERE clause |
query | str | None | Custom SQL query |
sample_rate | float | None | Random sample fraction (0.0-1.0) |
parallel | bool | False | Enable multi-core validation |
workers | int | None | Number of worker processes |
min_pass_rate | float | 0 | Minimum rule pass rate (0-100, 0=disabled) |
min_quality_score | float | 0 | Minimum quality score (0-100, 0=disabled) |
fail_on_error | bool | True | Fail Airflow task on validation failure |
push_results | bool | True | Push results to XCom |
Template fields: config_path, file_path, sources_file, source_name, table, where, query (supports .yaml and .yml extensions)
XCom output:
validation_results: Full results dictionarypassed: Boolean pass/fail resultpass_rate: Percentage of rules passed
Data source resolution order:
file_path— file-based validationsource_name+sources_file— named source validation- Config default (
sourceordata_sourcefrom config)
DataCheckSchemaOperator
Detect schema changes inside Airflow DAGs:
from datacheck.airflow.operators import DataCheckSchemaOperator
check_schema = DataCheckSchemaOperator(
task_id="check_schema",
config_path="/path/to/datacheck.yaml",
file_path="/data/orders.csv",
baseline_name="orders-v2",
fail_on_breaking=True,
push_results=True,
)Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
config_path | str | required | Path to validation config YAML |
file_path | str | None | Path to data file |
sources_file | str | None | Path to sources YAML |
source_name | str | None | Named source from sources.yaml |
table | str | None | Database table name |
baseline_name | str | "baseline" | Baseline identifier |
baseline_dir | str | ".datacheck/schemas" | Baseline storage directory |
fail_on_breaking | bool | True | Fail Airflow task on breaking schema changes |
push_results | bool | True | Push results to XCom |
XCom output:
schema_results: Schema comparison results dictionaryschema_compatible: Boolean compatibility flag
Auto-captures a new baseline if none exists yet.
BashOperator pattern
For simpler integration, use Airflow's BashOperator directly:
from airflow.operators.bash import BashOperator
validate = BashOperator(
task_id="validate_data",
bash_command="datacheck validate --config /path/to/config.yaml --output /tmp/results.json",
)Exit codes work directly with Airflow task status — exit code 0 means success, any non-zero code fails the task.
CI/CD Integration
DataCheck uses standard exit codes for automation. Any non-zero exit code fails the pipeline.
| Code | Meaning | CI/CD Effect |
|---|---|---|
| 0 | All rules passed | Pipeline continues |
| 1 | Error-severity failures | Pipeline fails (blocks deploy) |
| 2 | Configuration error | Pipeline fails |
| 3 | Data loading error | Pipeline fails |
| 4 | Unexpected error | Pipeline fails |
GitHub Actions
name: Data Quality Check
on: [push]
jobs:
validate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.12"
- name: Install DataCheck
run: pip install datacheck-cli
- name: Validate Data
run: datacheck validate --output results.json
- name: Upload Results
if: always()
uses: actions/upload-artifact@v4
with:
name: validation-results
path: results.jsonGitLab CI
validate_data:
image: python:3.12
script:
- pip install datacheck-cli
- datacheck validate --output results.json
artifacts:
paths:
- results.json
when: alwaysJenkins
pipeline {
agent any
stages {
stage('Data Validation') {
steps {
sh 'pip install datacheck-cli'
sh 'datacheck validate --output results.json'
}
post {
always {
archiveArtifacts artifacts: 'results.json', allowEmptyArchive: true
}
}
}
}
}Python API
ValidationEngine
from datacheck import ValidationEngine
engine = ValidationEngine(config_path=".datacheck.yaml")
summary = engine.validate()
print(f"Records: {summary.total_rows:,} rows, {summary.total_columns} columns")
print(f"Passed: {summary.passed_rules}/{summary.total_rules}")
for result in summary.get_failed_results():
print(f" FAIL: {result.rule_name} on {result.column} ({result.failed_rows} rows)")Constructor parameters:
| Parameter | Description |
|---|---|
config / config_path | Configuration object or path to YAML file |
parallel | Enable parallel execution (bool) |
workers | Number of worker processes (int) |
chunk_size | Rows per chunk for parallel execution (int) |
show_progress | Show progress bar (bool) |
notifier | Optional notifier instance (e.g. SlackNotifier) |
sources_file | Path to sources YAML (overrides config) |
Methods:
| Method | Description |
|---|---|
validate() | Validate using config defaults |
validate_file(file_path, **kwargs) | Validate a file (supports sampling, delta time travel) |
validate_sources(source_name, table, where, query, **kwargs) | Validate a named source |
validate_dataframe(df) | Validate a pre-loaded pandas DataFrame |
ValidationSummary
| Property | Type | Description |
|---|---|---|
total_rules | int | Total number of rules executed |
passed_rules | int | Rules that passed |
failed_rules | int | Rules that failed |
failed_errors | int | Failed rules with error severity |
failed_warnings | int | Failed rules with warning severity |
failed_info | int | Failed rules with info severity |
error_rules | int | Rules that encountered execution errors |
all_passed | bool | Whether all rules passed |
has_errors | bool | Whether any execution errors occurred |
results | list | List of RuleResult objects |
total_rows | int | Number of data rows |
total_columns | int | Number of columns |
timestamp | str | Execution timestamp |
duration | float | Execution duration in milliseconds |
trace_id | str | Unique run identifier for log correlation |
| Method | Returns | Description |
|---|---|---|
get_passed_results() | list | RuleResults that passed |
get_failed_results() | list | RuleResults that failed |
get_error_results() | list | RuleResults with execution errors |
to_dict() | dict | Serialize to dictionary |
RuleResult
| Property | Type | Description |
|---|---|---|
rule_name | str | Rule identifier |
column | str | Target column |
passed | bool | Whether the rule passed |
total_rows | int | Total rows checked |
failed_rows | int | Rows that failed |
rule_type | str | Rule type name |
check_name | str | Check name from config |
severity | str | error, warning, or info |
failure_details | FailureDetail | Detailed failure information |
error | str | Error message if rule errored |
execution_time | float | Execution time in milliseconds |
DataProfiler
from datacheck.profiling import DataProfiler
profiler = DataProfiler(outlier_method="zscore")
profile = profiler.profile(df, name="orders")Industry Templates
DataCheck ships with 8 config templates:
| Template | Use Case |
|---|---|
basic | Generic starter config for any data |
ecommerce | Order data, product catalogs, customer records |
healthcare | Patient data, HIPAA compliance, date formats |
finance | Transaction data, SOX compliance, sum validations |
saas | User activity, subscription data, engagement metrics |
iot | Sensor data, time-series, device telemetry |
rules-reference | Complete reference of all validation rules with examples |
sources | Data source connection templates with environment variable support |
datacheck config init --template ecommerce --with-sample-data
datacheck config init --template healthcare --with-sample-data --sample-rows 500
datacheck config templates # List all templates with descriptionsError Handling
Exception hierarchy
| Exception | When |
|---|---|
DataCheckError | Base exception for all DataCheck errors |
ConfigurationError | Invalid config structure, missing required fields |
ValidationError | Rule execution failures |
DataLoadError | File not found, encoding issues, connection failures |
RuleDefinitionError | Invalid rule parameters or missing required arguments |
UnsupportedFormatError | Unknown file format or missing optional library |
ColumnNotFoundError | Column not found in DataFrame |
EmptyDatasetError | No rows in loaded dataset |
All exceptions inherit from DataCheckError, so you can catch them broadly:
from datacheck.exceptions import DataCheckError, ConfigurationError, DataLoadError
try:
engine = ValidationEngine(config_path="config.yaml")
summary = engine.validate()
except ConfigurationError as e:
print(f"Config error: {e}")
except DataLoadError as e:
print(f"Data load error: {e}")
except DataCheckError as e:
print(f"DataCheck error: {e}")