DataCheck

Data validation engine for data engineers. Define validation rules in YAML, run checks on files, databases, and cloud warehouses from your terminal.

bash

pip install datacheck-cli

DataCheck provides the datacheck CLI and a Python API to validate data, profile quality, and detect schema changes. Run it locally during development, embed it in pipelines (Airflow, Dagster, Prefect), or integrate it into CI/CD workflows.

Installation

Requirements

Python 3.10, 3.11, or 3.12
pip 21.0 or greater

Install

bash

pip install datacheck-cli

Install with extras

Install only the connectors you need:

bash

# Databases
pip install datacheck-cli[postgresql]
pip install datacheck-cli[mysql]
pip install datacheck-cli[mssql]

# Cloud warehouses
pip install datacheck-cli[snowflake]
pip install datacheck-cli[bigquery]
pip install datacheck-cli[redshift]
pip install datacheck-cli[warehouses]     # All three warehouses

# Cloud storage
pip install datacheck-cli[cloud]          # S3, GCS, Azure Blob

# File formats
pip install datacheck-cli[deltalake]
pip install datacheck-cli[avro]
pip install datacheck-cli[duckdb]

# Statistical rules
pip install datacheck-cli[statistical]

# Everything
pip install datacheck-cli[all]

Verify

bash

datacheck version

Quickstart

1. Generate a config with sample data

bash

datacheck config init --with-sample-data

This creates a datacheck.yaml config file and a sample CSV file. Use --template to pick an industry template:

bash

datacheck config init --template ecommerce --with-sample-data

2. Run validation

bash

datacheck validate

DataCheck auto-discovers config files in this order: .datacheck.yaml → .datacheck.yml → datacheck.yaml → datacheck.yml. To specify a config explicitly:

bash

datacheck validate --config checks.yaml

3. Minimal config example

yaml

# .datacheck.yaml

data_source:
  type: csv
  path: ./data/orders.csv

checks:
  - name: id_check
    column: id
    rules:
      not_null: true
      unique: true

  - name: amount_check
    column: amount
    rules:
      not_null: true
      min: 0
      max: 10000

  - name: email_check
    column: email
    rules:
      email_valid: true

Configuration

Config file structure

A .datacheck.yaml file can contain:

yaml

# Data source (inline, for file-based sources)
data_source:
  type: csv
  path: ./data/orders.csv
  options:
    delimiter: ","
    encoding: utf-8

# Or reference named sources
sources_file: sources.yaml
source: production_db
table: orders

# Validation checks
checks:
  - name: id_check
    column: id
    rules:
      not_null: true
      unique: true
    severity: error        # error (default), warning, info
    enabled: true          # default: true

# Custom rule plugins
plugins:
  - ./custom_rules.py

# Config inheritance
extends: base.yaml

# Reporting
reporting:
  output_path: ./reports
  export_failures: true
  failures_file: failures.csv

# Notifications
notifications:
  slack_webhook: "${SLACK_WEBHOOK}"
  mention_on_failure: true

# Sampling
sampling:
  strategy: random
  params:
    sample_rate: 0.1

Checks definition

Each check targets a column and applies one or more rules:

yaml

checks:
  - name: order_amount         # Rule identifier
    column: amount             # Target column
    rules:
      not_null: true           # Rule type → parameters
      min: 0
      max: 100000
    severity: error            # error (default), warning, info
    enabled: true              # Toggle check on/off

  - name: warehouse_orders
    column: total
    source: snowflake_wh       # Override source for this check
    table: orders              # Override table for this check
    rules:
      min: 0

Severity levels

Severity	Effect
`error` (default)	Causes exit code 1 on failure
`warning`	Reported but does not fail the run
`info`	Informational only

Only error-severity failures cause a non-zero exit code.

Environment variables

Config files support environment variable substitution:

yaml

sources:
  production_db:
    type: postgresql
    host: ${DB_HOST}                  # Required — fails if not set
    port: ${DB_PORT:-5432}            # Optional — uses default 5432
    database: ${DB_NAME}
    user: ${DB_USER}
    password: ${DB_PASSWORD}

Use datacheck config env to list all variables referenced in a config and their current values:

bash

datacheck config env datacheck.yaml

Config inheritance

Use extends to inherit rules from a base config and override or add checks per environment:

yaml

# base.yaml — shared rules
data_source:
  type: csv
  path: ./data/orders.csv

checks:
  - name: id_check
    column: id
    rules:
      not_null: true
      unique: true

yaml

# production.yaml — inherits base, adds stricter rules
extends: base.yaml

checks:
  - name: amount_check
    column: amount
    rules:
      min: 0
      max: 50000
    severity: error

Config validation

Check config for errors before running:

bash

datacheck config validate
datacheck config validate datacheck.yaml --strict   # Fail on warnings too

Auto-generate config from data

Analyze a data file and generate validation rules automatically:

bash

datacheck config generate data.csv
datacheck config generate data.csv --confidence high
datacheck config generate data.csv -o custom.yaml

Options:

Flag	Description
`--confidence / -c`	Minimum confidence threshold: `low`, `medium` (default), `high`
`--output / -o`	Output config file path (default: `datacheck.yaml`)
`--name / -n`	Dataset name (default: derived from filename)
`--force / -f`	Overwrite existing config file

The generated config includes:

Type inference: Correctly distinguishes int, numeric, bool, date, and string types
Regex patterns: Auto-detected patterns for IDs, URLs, dates, etc. using [0-9] character classes (not \d) for cross-language compatibility
Statistical rules: mean_between, std_dev_less_than, percentile_range with thresholds derived from data
Semantic rules: email_valid, phone_valid, url_valid, json_valid based on column name detection
Cross-column rules: sum_equals auto-detected when two numeric columns sum to a third
Temporal rules: timestamp_range with 1-day margin, no_future_timestamps, date_format with detected format string
Reporting block: Includes output_path and export_failures settings
Data source block: Includes file type, path, and options (delimiter, encoding, etc.)

Config validation error reporting

datacheck config validate reports all errors at once instead of stopping at the first one. This includes schema errors, missing fields (name, column, rules), and invalid rule definitions:

bash

datacheck config validate checks.yaml
# Configuration has errors:
#   - Check #2: Missing required field 'column'
#   - Check #5: Missing required field 'rules'
#   - Schema validation failed at 'checks.3.rules.min': -1 is not valid

Show resolved config

Display the fully resolved configuration with env vars and inheritance applied:

bash

datacheck config show
datacheck config show datacheck.yaml --format json
datacheck config show --no-resolve-env
datacheck config show --no-resolve-extends

Merge configs

Merge multiple configuration files. Later files override values from earlier files:

bash

datacheck config merge base.yaml production.yaml
datacheck config merge base.yaml prod.yaml -o merged.yaml

List templates

Show all available templates with descriptions:

bash

datacheck config templates

Data Sources

File sources (inline in config)

CSV

yaml

data_source:
  type: csv
  path: ./data/orders.csv
  options:
    delimiter: ","
    encoding: utf-8

Parquet

yaml

data_source:
  type: parquet
  path: ./data/orders.parquet

Avro (requires pip install datacheck-cli[avro])

yaml

data_source:
  type: avro
  path: ./data/orders.avro

Delta Lake (requires pip install datacheck-cli[deltalake])

yaml

data_source:
  type: delta
  path: ./data/delta-table

Delta Lake supports time travel:

bash

datacheck validate --delta-version 5
datacheck validate --delta-timestamp "2026-01-15T10:00:00"
datacheck validate --storage-options '{"AWS_ACCESS_KEY_ID": "..."}'

SQLite

yaml

data_source:
  type: sqlite
  path: ./data/analytics.db

DuckDB (requires pip install datacheck-cli[duckdb])

yaml

data_source:
  type: duckdb
  path: ./data/analytics.duckdb

Database sources (named sources)

For databases, define named sources in a sources.yaml file:

yaml

# sources.yaml
sources:
  production_db:
    type: postgresql
    host: ${DB_HOST}
    port: ${DB_PORT:-5432}
    database: ${DB_NAME}
    user: ${DB_USER}
    password: ${DB_PASSWORD}
    schema: public

  mysql_db:
    type: mysql
    host: ${MYSQL_HOST}
    port: ${MYSQL_PORT:-3306}
    database: ${MYSQL_DB}
    user: ${MYSQL_USER}
    password: ${MYSQL_PASSWORD}

  mssql_db:
    type: mssql
    host: ${MSSQL_HOST}
    port: ${MSSQL_PORT:-1433}
    database: ${MSSQL_DB}
    user: ${MSSQL_USER}
    password: ${MSSQL_PASSWORD}

Cloud warehouse sources

yaml

# sources.yaml
sources:
  snowflake_wh:
    type: snowflake
    account: ${SF_ACCOUNT}
    user: ${SF_USER}
    password: ${SF_PASSWORD}
    warehouse: ${SF_WAREHOUSE:-COMPUTE_WH}
    database: ${SF_DATABASE}
    schema: ${SF_SCHEMA:-PUBLIC}
    role: ${SF_ROLE}
    # SSO: authenticator: externalbrowser
    # Key pair: private_key_path: /path/to/key.p8

  bigquery_ds:
    type: bigquery
    project_id: ${GCP_PROJECT}
    dataset_id: ${GCP_DATASET}
    credentials_path: /path/to/service-account.json
    location: US

  redshift_db:
    type: redshift
    host: ${REDSHIFT_HOST}
    port: ${REDSHIFT_PORT:-5439}
    database: ${REDSHIFT_DB}
    user: ${REDSHIFT_USER}
    password: ${REDSHIFT_PASSWORD}
    schema: public
    # IAM auth: cluster_identifier, region, iam_auth: true

Snowflake, BigQuery, and Redshift support server-side filtering and sampling — WHERE clauses, LIMIT, and TABLESAMPLE execute on the warehouse to minimize data transfer before validation runs locally.

Cloud storage sources

yaml

# sources.yaml
sources:
  s3_data:
    type: s3
    bucket: my-bucket
    path: data/orders.csv
    region: us-east-1
    access_key: ${AWS_ACCESS_KEY_ID}
    secret_key: ${AWS_SECRET_ACCESS_KEY}

  gcs_data:
    type: gcs
    bucket: my-bucket
    path: data/orders.csv
    credentials_path: /path/to/service-account.json

  azure_data:
    type: azure
    container: my-container
    path: data/orders.csv
    connection_string: ${AZURE_STORAGE_CONNECTION_STRING}
    # Or: account_name + account_key

Connection strings

You can also pass connection strings directly to the CLI:

bash

datacheck validate postgresql://user:pass@host:5432/db --table orders
datacheck validate mysql://user:pass@host:3306/db --table orders
datacheck validate mssql://user:pass@host:1433/database --table orders
datacheck validate snowflake://account/database/schema --table orders
datacheck validate bigquery://project/dataset --table orders
datacheck validate redshift://user:pass@host:5439/database/schema --table orders

Named sources and per-check overrides

Reference a named source in your config:

yaml

# .datacheck.yaml
sources_file: sources.yaml
source: production_db
table: orders

checks:
  - name: customer_email
    column: email
    rules:
      not_null: true

  - name: order_total
    column: total
    source: snowflake_wh      # Override source for this check
    table: orders
    rules:
      min: 0

Switch sources at runtime:

bash

datacheck validate --source snowflake_wh --config checks.yaml
datacheck validate --source s3_data --sources-file sources.yaml

Connection pre-validation

When validating against database sources, DataCheck tests connectivity for all referenced sources before running any validation rules. If multiple sources are unreachable, all connection errors are reported together:

Source connectivity check failed:
  - Source 'production_db' (postgresql): Connection failed — could not connect to server
  - Source 'analytics_wh' (snowflake): Connection failed — invalid credentials

For file-based sources, DataCheck verifies the file exists before validation begins.

SQL filtering

Use --table, --where, and --query for server-side filtering:

bash

datacheck validate --source production_db --table orders --where "status = 'active'"
datacheck validate --source production_db --query "SELECT * FROM orders WHERE created_at > '2026-01-01'"

Validation Rules

Null and uniqueness

Rule	YAML Syntax	Description
`not_null`	`not_null: true`	No null or missing values
`unique`	`unique: true`	No duplicate values (nulls ignored)
`unique_combination`	`unique_combination: [col1, col2]`	Composite uniqueness across columns

Numeric

Rule	YAML Syntax	Description
`min`	`min: 0`	Column >= value
`max`	`max: 10000`	Column <= value
`mean_between`	`mean_between: {min: 10, max: 50}`	Column mean within range
`std_dev_less_than`	`std_dev_less_than: 5.0`	Standard deviation below threshold
`percentile_range`	`percentile_range: {p25_min: 10, p25_max: 20, p75_min: 80, p75_max: 90}`	25th and 75th percentile bounds
`z_score_outliers`	`z_score_outliers: 3.0`	Detect outliers by z-score (default threshold: 3.0)
`distribution_type`	`distribution_type: 'normal'`	Validate distribution shape — `normal` or `uniform` (uses KS test)

String and pattern

Rule	YAML Syntax	Description
`regex`	`regex: '^[A-Z]{2}[0-9]{4}$'`	Match regex pattern
`allowed_values`	`allowed_values: [active, inactive, pending]`	Value in allowed set
`type`	`type: 'string'`	Data type check (`int`, `numeric`, `string`, `bool`, `date`, `datetime`)
`length`	`length: {min: 1, max: 100}`	String length constraints
`min_length`	`min_length: 1`	Minimum string length
`max_length`	`max_length: 255`	Maximum string length

Temporal

Rule	YAML Syntax	Description
`max_age`	`max_age: '24h'`	Data freshness — supports `h` (hours), `d` (days), `w` (weeks), `m` (minutes)
`timestamp_range`	`timestamp_range: {min: "2025-01-01", max: "2026-12-31"}`	Timestamps within range (ISO format)
`date_range`	`date_range: {min: "2025-01-01", max: "2026-12-31"}`	Alias for `timestamp_range`
`no_future_timestamps`	`no_future_timestamps: true`	No timestamps beyond current time
`date_format_valid`	`date_format_valid: '%Y-%m-%d'`	Validates date format (Python strftime)
`date_format`	`date_format: {format: '%Y-%m-%d'}`	Alias for `date_format_valid` (dict form)
`business_days_only`	`business_days_only: 'US'`	Weekdays only — pass country code (e.g., `'US'`, `'GB'`) or `true` for default

Semantic and format

Rule	YAML Syntax	Description
`email_valid`	`email_valid: true`	RFC 5322 email format (two-stage: regex pre-filter + email-validator library)
`phone_valid`	`phone_valid: 'US'`	Phone number format (phonenumbers library, supports all countries; pass country code or `true`)
`url_valid`	`url_valid: true`	URL structure validation
`json_valid`	`json_valid: true`	Valid JSON parsing

Cross-column and relationships

Rule	YAML Syntax	Description
`unique_combination`	`unique_combination: [col1, col2]`	Composite uniqueness across multiple columns
`foreign_key_exists`	Python API	Foreign key validation against a reference DataFrame (use Python API to pass live data)
`sum_equals`	`sum_equals: {column_a: col1, column_b: col2}`	Verify column equals sum of two other columns (with optional `tolerance`)

Example: complete config with rules

yaml

data_source:
  type: csv
  path: ./data/orders.csv

checks:
  - name: id_not_null
    column: id
    rules:
      not_null: true
      unique: true

  - name: amount_range
    column: amount
    rules:
      not_null: true
      min: 0
      max: 100000
      z_score_outliers:
        threshold: 3.0
    severity: error

  - name: email_format
    column: email
    rules:
      email_valid: true
    severity: warning

  - name: order_date
    column: created_at
    rules:
      no_future_timestamps: true
      max_age: '30d'
      date_format_valid: '%Y-%m-%d %H:%M:%S'

  - name: status_values
    column: status
    rules:
      allowed_values:
        - pending
        - confirmed
        - shipped
        - delivered
        - cancelled

Custom Rules

Creating custom rules

Create a Python file with functions decorated with @custom_rule. Each function receives a pd.Series and optional parameters, and returns a boolean pd.Series where True means valid:

python

# custom_rules.py
from datacheck.plugins.decorators import custom_rule
import pandas as pd

@custom_rule
def is_business_email(column: pd.Series, allowed_domains: list) -> pd.Series:
    """Validate that emails use approved business domains."""
    domains = column.dropna().str.split("@").str[1]
    return domains.isin(allowed_domains)

@custom_rule
def is_positive_margin(column: pd.Series, min_margin: float = 0.0) -> pd.Series:
    """Validate profit margin is above threshold."""
    return column.dropna() >= min_margin

Referencing plugins in config

yaml

plugins:
  - ./custom_rules.py

checks:
  - name: email_domain
    column: email
    rules:
      custom:
        rule: is_business_email
        params:
          allowed_domains: ["company.com", "corp.com"]

  - name: margin_check
    column: profit_margin
    rules:
      custom:
        rule: is_positive_margin
        params:
          min_margin: 0.05

Plugin registry

load_from_file() imports the Python module and registers all @custom_rule decorated functions
Registered rules become available through the RuleFactory alongside built-in rules
The global registry tracks all loaded custom rules

Data Profiling

Running profiling

bash

# Direct file path
datacheck profile data.csv

# Auto-discover config
datacheck profile

# Explicit config file
datacheck profile --config checks.yaml

# Named source
datacheck profile --source production_db --sources-file sources.yaml

# Named source with table
datacheck profile --source production_db --table orders

Profile options

Flag	Description
`--format / -f`	Output format: `terminal` (default), `json`, `markdown`
`--output / -o`	Write output to file
`--outlier-method`	Outlier detection method: `zscore` (default) or `iqr`
`--suggestions / --no-suggestions`	Show rule suggestions (default: enabled)
`--correlations / --no-correlations`	Show correlation matrix

bash

datacheck profile data.csv --format json -o profile.json
datacheck profile --outlier-method iqr --correlations
datacheck profile --format markdown -o report.md

What profiling computes

Basic counts: total rows, null count, unique count, duplicate count, completeness percentage
Numeric statistics: min, max, mean, median, standard deviation, 25th/50th/75th percentiles
Value distributions: top N values with counts
Outlier detection: Z-score method (|z| > 3.0) or IQR method (values outside Q1-1.5*IQR to Q3+1.5*IQR)
Correlation matrix: Pearson correlation between all numeric columns
Quality scoring: 0-100 score per column and per dataset

Quality scoring

Each column receives a 0-100 quality score based on:

Factor	What it measures
Completeness	Penalizes null/missing values
Uniqueness	Penalizes duplicate values
Validity	Type consistency across the column
Consistency	Low variance in categorical columns

The dataset score is a weighted average of all column scores.

Rule suggestions

The profiler automatically suggests validation rules based on data patterns:

Numeric columns: range rules, outlier thresholds, distribution checks, type (int vs numeric)
String columns: length constraints, regex patterns, allowed value sets
Temporal columns: date format detection, timestamp ranges (with margin), no_future_timestamps
Semantic columns: email_valid, phone_valid, url_valid, json_valid inferred from column names and content
Cross-column: sum_equals auto-detected when two numeric columns sum to a third
All columns: null checks, uniqueness rules

Schema Detection and Evolution

Commands

bash

datacheck schema capture              # Save current schema as baseline
datacheck schema compare              # Compare current data against baseline
datacheck schema show                 # Display detected schema
datacheck schema list                 # List all saved baselines
datacheck schema history              # View capture history

Schema capture

bash

datacheck schema capture data.csv
datacheck schema capture --source production_db --sources-file sources.yaml
datacheck schema capture --name v2-baseline
datacheck schema capture --baseline-dir ./schemas
datacheck schema capture --no-history

Flag	Description
`--name / -n`	Baseline name (default: `baseline`)
`--baseline-dir`	Storage directory (default: `.datacheck/schemas/`)
`--save-history / --no-history`	Save to history (default: enabled)

Schema compare

bash

datacheck schema compare data.csv
datacheck schema compare --baseline v2-baseline
datacheck schema compare --fail-on-breaking
datacheck schema compare --rename-threshold 0.9
datacheck schema compare --format json

Flag	Description
`--baseline / -b`	Baseline name to compare against (default: `baseline`)
`--rename-threshold`	Similarity threshold for rename detection (0.0-1.0, default: 0.8)
`--fail-on-breaking`	Exit with code 1 on breaking changes
`--format / -f`	Output format: `terminal` (default) or `json`

Schema compare exit codes

Code	Meaning
0	Compatible — no breaking changes
1	Breaking changes detected (with `--fail-on-breaking`)
2	Baseline not found
3	Data load error
4	Unexpected error

What schema tracks

For each column: name, data type, nullable status, position, unique value count, null percentage. For the dataset: row count, source identifier, capture timestamp.

Change types detected

Change	Compatibility Level
Column added	COMPATIBLE
Column removed	BREAKING
Column renamed	WARNING
Nullable changed	WARNING
Order changed	COMPATIBLE

Type change compatibility

Compatible changes (widening): int→float, int→string, float→string, bool→string, date→datetime, date→string, datetime→string

Breaking changes (narrowing): float→int, string→int, string→float, string→bool, datetime→date, string→datetime, string→date

Baseline storage

Baselines are stored as JSON files in .datacheck/schemas/
History entries are stored in .datacheck/schemas/history/ with timestamps (e.g. schema_20260212_143000.json)
Use datacheck schema list to see all baselines
Use datacheck schema history --limit 20 to see recent history

Sampling Strategies

Available strategies

Strategy	Description	Key Parameters
`random`	Simple random sampling	`sample_rate` or `sample_count`, `seed`
`stratified`	Preserve value distributions across groups	`stratify_column`, `min_per_stratum`
`time_based`	Sample within a time window	`time_column`, `start_date`, `end_date`
`error_focused`	Prioritize rows matching error conditions	`error_conditions` (e.g. `['age<0', 'price>10000']`)
`adaptive`	Adjust sample size based on data characteristics	`target_quality`, `initial_size`
`reservoir`	Single-pass sampling for streaming data	`sample_count`
`systematic`	Every Nth row	`sample_rate`
`top_n`	First N rows	`--top N`

CLI sampling flags

bash

# Random sampling
datacheck validate --sample-rate 0.1              # 10% of rows
datacheck validate --sample-count 1000            # Exactly 1000 rows
datacheck validate --sample-count 1000 --seed 42  # Reproducible

# First N rows
datacheck validate --top 500

# Strategy-based
datacheck validate --sample-strategy stratified --stratify region
datacheck validate --sample-strategy time_based --time-column created_at --start-date 2026-01-01 --end-date 2026-02-01
datacheck validate --sample-strategy error_focused --error-indicators "age<0,price>10000"

Flag	Description
`--sample-rate`	Fraction to sample (0.0-1.0)
`--sample-count`	Exact number of rows to sample
`--top`	First N rows only
`--sample-strategy`	Strategy name: `random`, `stratified`, `time_based`, `error_focused`, `adaptive`, `reservoir`
`--stratify`	Column for stratified sampling
`--seed`	Random seed for reproducibility
`--time-column`	Column for time-based sampling
`--start-date`	Start date (ISO format)
`--end-date`	End date (ISO format)
`--error-indicators`	Comma-separated error conditions

CLI Command Reference

`datacheck validate`

Run validation against data files or databases.

Data source flags:

Flag	Description
`data_source` (positional)	File path or connection string
`--config / -c`	Path to validation config YAML
`--source`	Named source from sources.yaml
`--sources-file`	Path to sources YAML file
`--table / -t`	Database table name
`--where / -w`	SQL WHERE clause for filtering
`--query / -q`	Custom SQL query
`--schema / -s`	Schema/dataset name

Warehouse-specific flags:

Flag	Description
`--warehouse`	Snowflake warehouse name
`--credentials`	Path to credentials file (BigQuery service account)
`--region`	AWS region (Redshift IAM auth)
`--cluster`	Cluster identifier (Redshift IAM auth)
`--iam-auth`	Use IAM authentication (Redshift)

Delta Lake flags:

Flag	Description
`--delta-version`	Delta Lake version to load (time travel)
`--delta-timestamp`	Timestamp to load data as of (ISO 8601)
`--storage-options`	JSON string of storage options for cloud access

Sampling flags: See Sampling Strategies.

Execution flags:

Flag	Description
`--parallel`	Enable multi-core parallel execution
`--workers`	Number of worker processes (default: CPU count)
`--chunk-size`	Rows per chunk for parallel processing (default: 10,000)
`--progress / --no-progress`	Show/hide progress bar

Output flags:

Flag	Description
`--output / -o`	Save results to a JSON file
`--csv-export`	Export failure details as CSV
`--suggestions / --no-suggestions`	Show improvement suggestions (default: enabled)
`--slack-webhook`	Slack webhook URL for notifications

Logging flags:

Flag	Description
`--log-level`	Log level: `DEBUG`, `INFO`, `WARNING`, `ERROR`, `CRITICAL`
`--log-format`	Log format: `console` (human-readable) or `json` (machine-parseable)
`--log-file`	Path to log file (with automatic rotation)
`--verbose / -v`	Shortcut for `--log-level DEBUG`

`datacheck profile`

Generate data quality profiles with statistics, quality scores, and rule suggestions.

Same data source flags as validate, plus:

Flag	Description
`--format / -f`	Output format: `terminal` (default), `json`, `markdown`
`--output / -o`	Write output to file
`--outlier-method`	Detection method: `zscore` (default) or `iqr`
`--suggestions / --no-suggestions`	Show rule suggestions
`--correlations / --no-correlations`	Show correlation matrix

`datacheck config`

Configuration management commands.

Subcommand	Description
`config init`	Generate config from template
`config init --template <name>`	Use specific template (`basic`, `ecommerce`, `healthcare`, `finance`, `saas`, `iot`, `rules-reference`, `sources`)
`config init --with-sample-data`	Also generate a sample CSV file
`config init --sample-rows N`	Number of sample rows to generate (default: 100)
`config init --force`	Overwrite existing config file
`config validate <file>`	Validate config file syntax and rule definitions
`config validate --strict`	Fail on warnings too
`config show <file>`	Show fully resolved config (env vars + inheritance applied)
`config show --format yaml/json`	Output format
`config show --no-resolve-env`	Skip environment variable resolution
`config show --no-resolve-extends`	Skip config inheritance resolution
`config merge <files...>`	Merge multiple configs (later files override earlier)
`config merge -o output.yaml`	Write merged result to file
`config generate <file>`	Auto-generate rules from data analysis
`config generate --confidence`	Minimum confidence: `low`, `medium` (default), `high`
`config templates`	List available templates with descriptions
`config env <file>`	Show environment variables referenced in config

`datacheck schema`

Schema evolution detection and management.

Subcommand	Description
`schema capture`	Save current schema as baseline
`schema compare`	Compare current data against baseline
`schema show`	Display detected schema (columns, types, nullable, stats)
`schema list`	List all saved baseline schemas
`schema history`	View capture history (newest first)

`datacheck version`

Display version information.

Exit codes

Code	Meaning
0	All rules passed (or only warning/info severity failures)
1	Some error-severity rules failed
2	Configuration error
3	Data loading error
4	Unexpected error

Output and Reporting

Terminal output

DataCheck uses Rich-formatted terminal output with color-coded results:

Green: Passed rules
Red: Failed rules
Yellow: Errors during rule execution

Output includes a statistics table (records, columns, rules, pass/fail counts), detailed failure tables (check name, column, failure count, sample values), and actionable improvement suggestions.

JSON export

bash

datacheck validate --output results.json

Exports full validation results in machine-readable JSON format, including all rule results, failure details, and summary statistics. Use this for automation and CI/CD integration.

CSV export

bash

datacheck validate --csv-export failures.csv

Exports failure details as CSV with columns: check_name, column, severity, failed_rows, reason, suggestion.

Markdown reports

bash

datacheck profile --format markdown -o report.md

Generates markdown-formatted profile reports with tables, statistics, and quality scores.

Slack notifications

Configure the webhook in your config file so you don't need to pass it every time:

yaml

notifications:
  slack_webhook: "${SLACK_WEBHOOK}"
  mention_on_failure: true    # @channel on failures (default: false)

Or pass it via the CLI (overrides the config value):

bash

datacheck validate --slack-webhook https://hooks.slack.com/services/...

Sends validation results to Slack with:

Color-coded messages (green for pass, red for fail)
Summary statistics and failed rules
Optional @channel mention on failures (via mention_on_failure)
Up to 5 failed rule details with row counts

Parallel Execution and Performance

Enabling parallel mode

bash

datacheck validate --parallel
datacheck validate --parallel --workers 4
datacheck validate --parallel --chunk-size 50000
datacheck validate --parallel --progress

Flag	Description
`--parallel`	Enable multi-core parallel execution
`--workers`	Number of worker processes (default: CPU count)
`--chunk-size`	Rows per chunk (default: 10,000)
`--progress / --no-progress`	Show/hide progress bar

How parallel execution works

Splits the DataFrame into chunks based on --chunk-size
Processes chunks in parallel using multiprocessing.Pool
Aggregates results across chunks (combines pass/fail counts, merges failure details)
Automatically falls back to sequential execution for small datasets
Shows a Rich progress bar with spinner, elapsed time, and remaining time

Performance features

PyArrow backend: Vectorized operations for faster validation (e.g. fast null count via Arrow)
Lazy loading: Cloud connectors are loaded only when needed — no unnecessary dependencies
Memory optimization: Memory-aware chunk sizing, worker auto-scaling, and large file handling
Caching: Regex compilation caching (@lru_cache) and compute-once patterns for expensive operations
Vectorized rules: NumPy/Pandas vectorized operations — no Python loops in hot paths

Logging

Log configuration

bash

datacheck validate --verbose                        # DEBUG level
datacheck validate --log-level WARNING               # Specific level
datacheck validate --log-format json                 # Machine-parseable JSON logs
datacheck validate --log-file validation.log         # Log to file (with rotation)
datacheck validate --log-level DEBUG --log-format json --log-file debug.log

Flag	Description
`--log-level`	`DEBUG`, `INFO`, `WARNING`, `ERROR`, `CRITICAL`
`--log-format`	`console` (human-readable, default) or `json` (machine-parseable)
`--log-file`	Path to log file (automatic rotation)
`--verbose / -v`	Shortcut for `--log-level DEBUG`

Logging features

Structured logging: Console and JSON formatters for different use cases
Sensitive data masking: Automatically masks credentials and passwords in log output
Trace IDs: Unique trace ID per validation run for log correlation across systems
File rotation: Automatic log file rotation to prevent unbounded growth

Security

Credential handling

Environment variables: Use ${VAR} and ${VAR:-default} syntax in config files — never hardcode credentials
Credential files: Load credentials from external files
Password masking: Credentials are automatically masked in logs and terminal output
Config env audit: Use datacheck config env to verify all required variables are set

Connection security

Connection string validation before attempting connections
SQL injection prevention: table name validation, WHERE clause scanning, parameterized queries
Path traversal prevention with null byte and symlink detection
SSL/TLS enforcement for warehouse connections

Airflow Integration

DataCheck provides two Airflow operators for use in DAGs, plus a simpler BashOperator pattern.

DataCheckOperator

Run data validation inside Airflow DAGs:

python

from datacheck.airflow.operators import DataCheckOperator

validate_orders = DataCheckOperator(
    task_id="validate_orders",
    config_path="/path/to/datacheck.yaml",
    file_path="/data/orders.csv",
    fail_on_error=True,
    push_results=True,
    min_pass_rate=95.0,
)

Parameters:

Parameter	Type	Default	Description
`config_path`	str	required	Path to validation config YAML
`file_path`	str	None	Path to data file (CSV, Parquet, Avro, Delta, etc.)
`sources_file`	str	None	Path to sources YAML (overrides config)
`source_name`	str	None	Named source from sources.yaml
`table`	str	None	Database table name
`where`	str	None	SQL WHERE clause
`query`	str	None	Custom SQL query
`sample_rate`	float	None	Random sample fraction (0.0-1.0)
`parallel`	bool	False	Enable multi-core validation
`workers`	int	None	Number of worker processes
`min_pass_rate`	float	0	Minimum rule pass rate (0-100, 0=disabled)
`min_quality_score`	float	0	Minimum quality score (0-100, 0=disabled)
`fail_on_error`	bool	True	Fail Airflow task on validation failure
`push_results`	bool	True	Push results to XCom

Template fields: config_path, file_path, sources_file, source_name, table, where, query (supports .yaml and .yml extensions)

XCom output:

validation_results: Full results dictionary
passed: Boolean pass/fail result
pass_rate: Percentage of rules passed

Data source resolution order:

file_path — file-based validation
source_name + sources_file — named source validation
Config default (source or data_source from config)

DataCheckSchemaOperator

Detect schema changes inside Airflow DAGs:

python

from datacheck.airflow.operators import DataCheckSchemaOperator

check_schema = DataCheckSchemaOperator(
    task_id="check_schema",
    config_path="/path/to/datacheck.yaml",
    file_path="/data/orders.csv",
    baseline_name="orders-v2",
    fail_on_breaking=True,
    push_results=True,
)

Parameters:

Parameter	Type	Default	Description
`config_path`	str	required	Path to validation config YAML
`file_path`	str	None	Path to data file
`sources_file`	str	None	Path to sources YAML
`source_name`	str	None	Named source from sources.yaml
`table`	str	None	Database table name
`baseline_name`	str	`"baseline"`	Baseline identifier
`baseline_dir`	str	`".datacheck/schemas"`	Baseline storage directory
`fail_on_breaking`	bool	True	Fail Airflow task on breaking schema changes
`push_results`	bool	True	Push results to XCom

XCom output:

schema_results: Schema comparison results dictionary
schema_compatible: Boolean compatibility flag

Auto-captures a new baseline if none exists yet.

BashOperator pattern

For simpler integration, use Airflow's BashOperator directly:

python

from airflow.operators.bash import BashOperator

validate = BashOperator(
    task_id="validate_data",
    bash_command="datacheck validate --config /path/to/config.yaml --output /tmp/results.json",
)

Exit codes work directly with Airflow task status — exit code 0 means success, any non-zero code fails the task.

CI/CD Integration

DataCheck uses standard exit codes for automation. Any non-zero exit code fails the pipeline.

Code	Meaning	CI/CD Effect
0	All rules passed	Pipeline continues
1	Error-severity failures	Pipeline fails (blocks deploy)
2	Configuration error	Pipeline fails
3	Data loading error	Pipeline fails
4	Unexpected error	Pipeline fails

GitHub Actions

yaml

name: Data Quality Check
on: [push]

jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.12"
      - name: Install DataCheck
        run: pip install datacheck-cli
      - name: Validate Data
        run: datacheck validate --output results.json
      - name: Upload Results
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: validation-results
          path: results.json

GitLab CI

yaml

validate_data:
  image: python:3.12
  script:
    - pip install datacheck-cli
    - datacheck validate --output results.json
  artifacts:
    paths:
      - results.json
    when: always

Jenkins

groovy

pipeline {
    agent any
    stages {
        stage('Data Validation') {
            steps {
                sh 'pip install datacheck-cli'
                sh 'datacheck validate --output results.json'
            }
            post {
                always {
                    archiveArtifacts artifacts: 'results.json', allowEmptyArchive: true
                }
            }
        }
    }
}

Python API

ValidationEngine

python

from datacheck import ValidationEngine

engine = ValidationEngine(config_path=".datacheck.yaml")
summary = engine.validate()

print(f"Records: {summary.total_rows:,} rows, {summary.total_columns} columns")
print(f"Passed: {summary.passed_rules}/{summary.total_rules}")

for result in summary.get_failed_results():
    print(f"  FAIL: {result.rule_name} on {result.column} ({result.failed_rows} rows)")

Constructor parameters:

Parameter	Description
`config` / `config_path`	Configuration object or path to YAML file
`parallel`	Enable parallel execution (bool)
`workers`	Number of worker processes (int)
`chunk_size`	Rows per chunk for parallel execution (int)
`show_progress`	Show progress bar (bool)
`notifier`	Optional notifier instance (e.g. `SlackNotifier`)
`sources_file`	Path to sources YAML (overrides config)

Methods:

Method	Description
`validate()`	Validate using config defaults
`validate_file(file_path, **kwargs)`	Validate a file (supports sampling, delta time travel)
`validate_sources(source_name, table, where, query, **kwargs)`	Validate a named source
`validate_dataframe(df)`	Validate a pre-loaded pandas DataFrame

ValidationSummary

Property	Type	Description
`total_rules`	int	Total number of rules executed
`passed_rules`	int	Rules that passed
`failed_rules`	int	Rules that failed
`failed_errors`	int	Failed rules with `error` severity
`failed_warnings`	int	Failed rules with `warning` severity
`failed_info`	int	Failed rules with `info` severity
`error_rules`	int	Rules that encountered execution errors
`all_passed`	bool	Whether all rules passed
`has_errors`	bool	Whether any execution errors occurred
`results`	list	List of `RuleResult` objects
`total_rows`	int	Number of data rows
`total_columns`	int	Number of columns
`timestamp`	str	Execution timestamp
`duration`	float	Execution duration in milliseconds
`trace_id`	str	Unique run identifier for log correlation

Method	Returns	Description
`get_passed_results()`	list	RuleResults that passed
`get_failed_results()`	list	RuleResults that failed
`get_error_results()`	list	RuleResults with execution errors
`to_dict()`	dict	Serialize to dictionary

RuleResult

Property	Type	Description
`rule_name`	str	Rule identifier
`column`	str	Target column
`passed`	bool	Whether the rule passed
`total_rows`	int	Total rows checked
`failed_rows`	int	Rows that failed
`rule_type`	str	Rule type name
`check_name`	str	Check name from config
`severity`	str	`error`, `warning`, or `info`
`failure_details`	FailureDetail	Detailed failure information
`error`	str	Error message if rule errored
`execution_time`	float	Execution time in milliseconds

DataProfiler

python

from datacheck.profiling import DataProfiler

profiler = DataProfiler(outlier_method="zscore")
profile = profiler.profile(df, name="orders")

Industry Templates

DataCheck ships with 8 config templates:

Template	Use Case
`basic`	Generic starter config for any data
`ecommerce`	Order data, product catalogs, customer records
`healthcare`	Patient data, HIPAA compliance, date formats
`finance`	Transaction data, SOX compliance, sum validations
`saas`	User activity, subscription data, engagement metrics
`iot`	Sensor data, time-series, device telemetry
`rules-reference`	Complete reference of all validation rules with examples
`sources`	Data source connection templates with environment variable support

bash

datacheck config init --template ecommerce --with-sample-data
datacheck config init --template healthcare --with-sample-data --sample-rows 500
datacheck config templates   # List all templates with descriptions

Error Handling

Exception hierarchy

Exception	When
`DataCheckError`	Base exception for all DataCheck errors
`ConfigurationError`	Invalid config structure, missing required fields
`ValidationError`	Rule execution failures
`DataLoadError`	File not found, encoding issues, connection failures
`RuleDefinitionError`	Invalid rule parameters or missing required arguments
`UnsupportedFormatError`	Unknown file format or missing optional library
`ColumnNotFoundError`	Column not found in DataFrame
`EmptyDatasetError`	No rows in loaded dataset

All exceptions inherit from DataCheckError, so you can catch them broadly:

python

from datacheck.exceptions import DataCheckError, ConfigurationError, DataLoadError

try:
    engine = ValidationEngine(config_path="config.yaml")
    summary = engine.validate()
except ConfigurationError as e:
    print(f"Config error: {e}")
except DataLoadError as e:
    print(f"Data load error: {e}")
except DataCheckError as e:
    print(f"DataCheck error: {e}")

DataCheck ​

Installation ​

Requirements ​

Install ​

Install with extras ​

Verify ​

Quickstart ​

1. Generate a config with sample data ​

2. Run validation ​

3. Minimal config example ​

Configuration ​

Config file structure ​

Checks definition ​

Severity levels ​

Environment variables ​

Config inheritance ​

Config validation ​

Auto-generate config from data ​

Config validation error reporting ​

Show resolved config ​

Merge configs ​

List templates ​

Data Sources ​

File sources (inline in config) ​

Database sources (named sources) ​

Cloud warehouse sources ​

Cloud storage sources ​

Connection strings ​

Named sources and per-check overrides ​

Connection pre-validation ​

SQL filtering ​

Validation Rules ​

Null and uniqueness ​

Numeric ​

String and pattern ​

Temporal ​

Semantic and format ​

Cross-column and relationships ​

Example: complete config with rules ​

Custom Rules ​

Creating custom rules ​

Referencing plugins in config ​

Plugin registry ​

Data Profiling ​

Running profiling ​

Profile options ​

What profiling computes ​

Quality scoring ​

Rule suggestions ​

Schema Detection and Evolution ​

Commands ​

Schema capture ​

Schema compare ​

Schema compare exit codes ​

What schema tracks ​

Change types detected ​

Type change compatibility ​

Baseline storage ​

Sampling Strategies ​

Available strategies ​

CLI sampling flags ​

CLI Command Reference ​

datacheck validate ​

datacheck profile ​

datacheck config ​

datacheck schema ​

datacheck version ​

Exit codes ​

Output and Reporting ​

Terminal output ​

JSON export ​

CSV export ​

Markdown reports ​

Slack notifications ​

Parallel Execution and Performance ​

Enabling parallel mode ​

How parallel execution works ​

Performance features ​

Logging ​

Log configuration ​

DataCheck

Installation

Requirements

Install

Install with extras

Verify

Quickstart

1. Generate a config with sample data

2. Run validation

3. Minimal config example

Configuration

Config file structure

Checks definition

Severity levels

Environment variables

Config inheritance

Config validation

Auto-generate config from data

Config validation error reporting

Show resolved config

Merge configs

List templates

Data Sources

File sources (inline in config)

Database sources (named sources)

Cloud warehouse sources

Cloud storage sources

Connection strings

Named sources and per-check overrides

Connection pre-validation

SQL filtering

Validation Rules

Null and uniqueness

Numeric

String and pattern

Temporal

Semantic and format

Cross-column and relationships

Example: complete config with rules

Custom Rules

Creating custom rules

Referencing plugins in config

Plugin registry

Data Profiling

Running profiling

Profile options

What profiling computes

Quality scoring

Rule suggestions

Schema Detection and Evolution

Commands

Schema capture

Schema compare

Schema compare exit codes

What schema tracks

Change types detected

Type change compatibility

Baseline storage

Sampling Strategies

Available strategies

CLI sampling flags

CLI Command Reference

`datacheck validate`

`datacheck profile`

`datacheck config`

`datacheck schema`

`datacheck version`

Exit codes

Output and Reporting

Terminal output

JSON export

CSV export

Markdown reports

Slack notifications

Parallel Execution and Performance

Enabling parallel mode

How parallel execution works

Performance features

Logging

Log configuration