Skip to content

DataCheck

Data validation engine for data engineers. Define validation rules in YAML, run checks on files, databases, and cloud warehouses from your terminal.

bash
pip install datacheck-cli

DataCheck provides the datacheck CLI and a Python API to validate data, profile quality, and detect schema changes. Run it locally during development, embed it in pipelines (Airflow, Dagster, Prefect), or integrate it into CI/CD workflows.


Installation

Requirements

  • Python 3.10, 3.11, or 3.12
  • pip 21.0 or greater

Install

bash
pip install datacheck-cli

Install with extras

Install only the connectors you need:

bash
# Databases
pip install datacheck-cli[postgresql]
pip install datacheck-cli[mysql]
pip install datacheck-cli[mssql]

# Cloud warehouses
pip install datacheck-cli[snowflake]
pip install datacheck-cli[bigquery]
pip install datacheck-cli[redshift]
pip install datacheck-cli[warehouses]     # All three warehouses

# Cloud storage
pip install datacheck-cli[cloud]          # S3, GCS, Azure Blob

# File formats
pip install datacheck-cli[deltalake]
pip install datacheck-cli[avro]
pip install datacheck-cli[duckdb]

# Statistical rules
pip install datacheck-cli[statistical]

# Everything
pip install datacheck-cli[all]

Verify

bash
datacheck version

Quickstart

1. Generate a config with sample data

bash
datacheck config init --with-sample-data

This creates a datacheck.yaml config file and a sample CSV file. Use --template to pick an industry template:

bash
datacheck config init --template ecommerce --with-sample-data

2. Run validation

bash
datacheck validate

DataCheck auto-discovers config files in this order: .datacheck.yaml.datacheck.ymldatacheck.yamldatacheck.yml. To specify a config explicitly:

bash
datacheck validate --config checks.yaml

3. Minimal config example

yaml
# .datacheck.yaml

data_source:
  type: csv
  path: ./data/orders.csv

checks:
  - name: id_check
    column: id
    rules:
      not_null: true
      unique: true

  - name: amount_check
    column: amount
    rules:
      not_null: true
      min: 0
      max: 10000

  - name: email_check
    column: email
    rules:
      email_valid: true

Configuration

Config file structure

A .datacheck.yaml file can contain:

yaml
# Data source (inline, for file-based sources)
data_source:
  type: csv
  path: ./data/orders.csv
  options:
    delimiter: ","
    encoding: utf-8

# Or reference named sources
sources_file: sources.yaml
source: production_db
table: orders

# Validation checks
checks:
  - name: id_check
    column: id
    rules:
      not_null: true
      unique: true
    severity: error        # error (default), warning, info
    enabled: true          # default: true

# Custom rule plugins
plugins:
  - ./custom_rules.py

# Config inheritance
extends: base.yaml

# Reporting
reporting:
  output_path: ./reports
  export_failures: true
  failures_file: failures.csv

# Notifications
notifications:
  slack_webhook: "${SLACK_WEBHOOK}"
  mention_on_failure: true

# Sampling
sampling:
  strategy: random
  params:
    sample_rate: 0.1

Checks definition

Each check targets a column and applies one or more rules:

yaml
checks:
  - name: order_amount         # Rule identifier
    column: amount             # Target column
    rules:
      not_null: true           # Rule type → parameters
      min: 0
      max: 100000
    severity: error            # error (default), warning, info
    enabled: true              # Toggle check on/off

  - name: warehouse_orders
    column: total
    source: snowflake_wh       # Override source for this check
    table: orders              # Override table for this check
    rules:
      min: 0

Severity levels

SeverityEffect
error (default)Causes exit code 1 on failure
warningReported but does not fail the run
infoInformational only

Only error-severity failures cause a non-zero exit code.

Environment variables

Config files support environment variable substitution:

yaml
sources:
  production_db:
    type: postgresql
    host: ${DB_HOST}                  # Required — fails if not set
    port: ${DB_PORT:-5432}            # Optional — uses default 5432
    database: ${DB_NAME}
    user: ${DB_USER}
    password: ${DB_PASSWORD}

Use datacheck config env to list all variables referenced in a config and their current values:

bash
datacheck config env datacheck.yaml

Config inheritance

Use extends to inherit rules from a base config and override or add checks per environment:

yaml
# base.yaml — shared rules
data_source:
  type: csv
  path: ./data/orders.csv

checks:
  - name: id_check
    column: id
    rules:
      not_null: true
      unique: true
yaml
# production.yaml — inherits base, adds stricter rules
extends: base.yaml

checks:
  - name: amount_check
    column: amount
    rules:
      min: 0
      max: 50000
    severity: error

Config validation

Check config for errors before running:

bash
datacheck config validate
datacheck config validate datacheck.yaml --strict   # Fail on warnings too

Auto-generate config from data

Analyze a data file and generate validation rules automatically:

bash
datacheck config generate data.csv
datacheck config generate data.csv --confidence high
datacheck config generate data.csv -o custom.yaml

Options:

FlagDescription
--confidence / -cMinimum confidence threshold: low, medium (default), high
--output / -oOutput config file path (default: datacheck.yaml)
--name / -nDataset name (default: derived from filename)
--force / -fOverwrite existing config file

The generated config includes:

  • Type inference: Correctly distinguishes int, numeric, bool, date, and string types
  • Regex patterns: Auto-detected patterns for IDs, URLs, dates, etc. using [0-9] character classes (not \d) for cross-language compatibility
  • Statistical rules: mean_between, std_dev_less_than, percentile_range with thresholds derived from data
  • Semantic rules: email_valid, phone_valid, url_valid, json_valid based on column name detection
  • Cross-column rules: sum_equals auto-detected when two numeric columns sum to a third
  • Temporal rules: timestamp_range with 1-day margin, no_future_timestamps, date_format with detected format string
  • Reporting block: Includes output_path and export_failures settings
  • Data source block: Includes file type, path, and options (delimiter, encoding, etc.)

Config validation error reporting

datacheck config validate reports all errors at once instead of stopping at the first one. This includes schema errors, missing fields (name, column, rules), and invalid rule definitions:

bash
datacheck config validate checks.yaml
# Configuration has errors:
#   - Check #2: Missing required field 'column'
#   - Check #5: Missing required field 'rules'
#   - Schema validation failed at 'checks.3.rules.min': -1 is not valid

Show resolved config

Display the fully resolved configuration with env vars and inheritance applied:

bash
datacheck config show
datacheck config show datacheck.yaml --format json
datacheck config show --no-resolve-env
datacheck config show --no-resolve-extends

Merge configs

Merge multiple configuration files. Later files override values from earlier files:

bash
datacheck config merge base.yaml production.yaml
datacheck config merge base.yaml prod.yaml -o merged.yaml

List templates

Show all available templates with descriptions:

bash
datacheck config templates

Data Sources

File sources (inline in config)

CSV

yaml
data_source:
  type: csv
  path: ./data/orders.csv
  options:
    delimiter: ","
    encoding: utf-8

Parquet

yaml
data_source:
  type: parquet
  path: ./data/orders.parquet

Avro (requires pip install datacheck-cli[avro])

yaml
data_source:
  type: avro
  path: ./data/orders.avro

Delta Lake (requires pip install datacheck-cli[deltalake])

yaml
data_source:
  type: delta
  path: ./data/delta-table

Delta Lake supports time travel:

bash
datacheck validate --delta-version 5
datacheck validate --delta-timestamp "2026-01-15T10:00:00"
datacheck validate --storage-options '{"AWS_ACCESS_KEY_ID": "..."}'

SQLite

yaml
data_source:
  type: sqlite
  path: ./data/analytics.db

DuckDB (requires pip install datacheck-cli[duckdb])

yaml
data_source:
  type: duckdb
  path: ./data/analytics.duckdb

Database sources (named sources)

For databases, define named sources in a sources.yaml file:

yaml
# sources.yaml
sources:
  production_db:
    type: postgresql
    host: ${DB_HOST}
    port: ${DB_PORT:-5432}
    database: ${DB_NAME}
    user: ${DB_USER}
    password: ${DB_PASSWORD}
    schema: public

  mysql_db:
    type: mysql
    host: ${MYSQL_HOST}
    port: ${MYSQL_PORT:-3306}
    database: ${MYSQL_DB}
    user: ${MYSQL_USER}
    password: ${MYSQL_PASSWORD}

  mssql_db:
    type: mssql
    host: ${MSSQL_HOST}
    port: ${MSSQL_PORT:-1433}
    database: ${MSSQL_DB}
    user: ${MSSQL_USER}
    password: ${MSSQL_PASSWORD}

Cloud warehouse sources

yaml
# sources.yaml
sources:
  snowflake_wh:
    type: snowflake
    account: ${SF_ACCOUNT}
    user: ${SF_USER}
    password: ${SF_PASSWORD}
    warehouse: ${SF_WAREHOUSE:-COMPUTE_WH}
    database: ${SF_DATABASE}
    schema: ${SF_SCHEMA:-PUBLIC}
    role: ${SF_ROLE}
    # SSO: authenticator: externalbrowser
    # Key pair: private_key_path: /path/to/key.p8

  bigquery_ds:
    type: bigquery
    project_id: ${GCP_PROJECT}
    dataset_id: ${GCP_DATASET}
    credentials_path: /path/to/service-account.json
    location: US

  redshift_db:
    type: redshift
    host: ${REDSHIFT_HOST}
    port: ${REDSHIFT_PORT:-5439}
    database: ${REDSHIFT_DB}
    user: ${REDSHIFT_USER}
    password: ${REDSHIFT_PASSWORD}
    schema: public
    # IAM auth: cluster_identifier, region, iam_auth: true

Snowflake, BigQuery, and Redshift support server-side filtering and sampling — WHERE clauses, LIMIT, and TABLESAMPLE execute on the warehouse to minimize data transfer before validation runs locally.

Cloud storage sources

yaml
# sources.yaml
sources:
  s3_data:
    type: s3
    bucket: my-bucket
    path: data/orders.csv
    region: us-east-1
    access_key: ${AWS_ACCESS_KEY_ID}
    secret_key: ${AWS_SECRET_ACCESS_KEY}

  gcs_data:
    type: gcs
    bucket: my-bucket
    path: data/orders.csv
    credentials_path: /path/to/service-account.json

  azure_data:
    type: azure
    container: my-container
    path: data/orders.csv
    connection_string: ${AZURE_STORAGE_CONNECTION_STRING}
    # Or: account_name + account_key

Connection strings

You can also pass connection strings directly to the CLI:

bash
datacheck validate postgresql://user:pass@host:5432/db --table orders
datacheck validate mysql://user:pass@host:3306/db --table orders
datacheck validate mssql://user:pass@host:1433/database --table orders
datacheck validate snowflake://account/database/schema --table orders
datacheck validate bigquery://project/dataset --table orders
datacheck validate redshift://user:pass@host:5439/database/schema --table orders

Named sources and per-check overrides

Reference a named source in your config:

yaml
# .datacheck.yaml
sources_file: sources.yaml
source: production_db
table: orders

checks:
  - name: customer_email
    column: email
    rules:
      not_null: true

  - name: order_total
    column: total
    source: snowflake_wh      # Override source for this check
    table: orders
    rules:
      min: 0

Switch sources at runtime:

bash
datacheck validate --source snowflake_wh --config checks.yaml
datacheck validate --source s3_data --sources-file sources.yaml

Connection pre-validation

When validating against database sources, DataCheck tests connectivity for all referenced sources before running any validation rules. If multiple sources are unreachable, all connection errors are reported together:

Source connectivity check failed:
  - Source 'production_db' (postgresql): Connection failed — could not connect to server
  - Source 'analytics_wh' (snowflake): Connection failed — invalid credentials

For file-based sources, DataCheck verifies the file exists before validation begins.

SQL filtering

Use --table, --where, and --query for server-side filtering:

bash
datacheck validate --source production_db --table orders --where "status = 'active'"
datacheck validate --source production_db --query "SELECT * FROM orders WHERE created_at > '2026-01-01'"

Validation Rules

Null and uniqueness

RuleYAML SyntaxDescription
not_nullnot_null: trueNo null or missing values
uniqueunique: trueNo duplicate values (nulls ignored)
unique_combinationunique_combination: [col1, col2]Composite uniqueness across columns

Numeric

RuleYAML SyntaxDescription
minmin: 0Column >= value
maxmax: 10000Column <= value
mean_betweenmean_between: {min: 10, max: 50}Column mean within range
std_dev_less_thanstd_dev_less_than: 5.0Standard deviation below threshold
percentile_rangepercentile_range: {p25_min: 10, p25_max: 20, p75_min: 80, p75_max: 90}25th and 75th percentile bounds
z_score_outliersz_score_outliers: 3.0Detect outliers by z-score (default threshold: 3.0)
distribution_typedistribution_type: 'normal'Validate distribution shape — normal or uniform (uses KS test)

String and pattern

RuleYAML SyntaxDescription
regexregex: '^[A-Z]{2}[0-9]{4}$'Match regex pattern
allowed_valuesallowed_values: [active, inactive, pending]Value in allowed set
typetype: 'string'Data type check (int, numeric, string, bool, date, datetime)
lengthlength: {min: 1, max: 100}String length constraints
min_lengthmin_length: 1Minimum string length
max_lengthmax_length: 255Maximum string length

Temporal

RuleYAML SyntaxDescription
max_agemax_age: '24h'Data freshness — supports h (hours), d (days), w (weeks), m (minutes)
timestamp_rangetimestamp_range: {min: "2025-01-01", max: "2026-12-31"}Timestamps within range (ISO format)
date_rangedate_range: {min: "2025-01-01", max: "2026-12-31"}Alias for timestamp_range
no_future_timestampsno_future_timestamps: trueNo timestamps beyond current time
date_format_validdate_format_valid: '%Y-%m-%d'Validates date format (Python strftime)
date_formatdate_format: {format: '%Y-%m-%d'}Alias for date_format_valid (dict form)
business_days_onlybusiness_days_only: 'US'Weekdays only — pass country code (e.g., 'US', 'GB') or true for default

Semantic and format

RuleYAML SyntaxDescription
email_validemail_valid: trueRFC 5322 email format (two-stage: regex pre-filter + email-validator library)
phone_validphone_valid: 'US'Phone number format (phonenumbers library, supports all countries; pass country code or true)
url_validurl_valid: trueURL structure validation
json_validjson_valid: trueValid JSON parsing

Cross-column and relationships

RuleYAML SyntaxDescription
unique_combinationunique_combination: [col1, col2]Composite uniqueness across multiple columns
foreign_key_existsPython APIForeign key validation against a reference DataFrame (use Python API to pass live data)
sum_equalssum_equals: {column_a: col1, column_b: col2}Verify column equals sum of two other columns (with optional tolerance)

Example: complete config with rules

yaml
data_source:
  type: csv
  path: ./data/orders.csv

checks:
  - name: id_not_null
    column: id
    rules:
      not_null: true
      unique: true

  - name: amount_range
    column: amount
    rules:
      not_null: true
      min: 0
      max: 100000
      z_score_outliers:
        threshold: 3.0
    severity: error

  - name: email_format
    column: email
    rules:
      email_valid: true
    severity: warning

  - name: order_date
    column: created_at
    rules:
      no_future_timestamps: true
      max_age: '30d'
      date_format_valid: '%Y-%m-%d %H:%M:%S'

  - name: status_values
    column: status
    rules:
      allowed_values:
        - pending
        - confirmed
        - shipped
        - delivered
        - cancelled

Custom Rules

Creating custom rules

Create a Python file with functions decorated with @custom_rule. Each function receives a pd.Series and optional parameters, and returns a boolean pd.Series where True means valid:

python
# custom_rules.py
from datacheck.plugins.decorators import custom_rule
import pandas as pd

@custom_rule
def is_business_email(column: pd.Series, allowed_domains: list) -> pd.Series:
    """Validate that emails use approved business domains."""
    domains = column.dropna().str.split("@").str[1]
    return domains.isin(allowed_domains)

@custom_rule
def is_positive_margin(column: pd.Series, min_margin: float = 0.0) -> pd.Series:
    """Validate profit margin is above threshold."""
    return column.dropna() >= min_margin

Referencing plugins in config

yaml
plugins:
  - ./custom_rules.py

checks:
  - name: email_domain
    column: email
    rules:
      custom:
        rule: is_business_email
        params:
          allowed_domains: ["company.com", "corp.com"]

  - name: margin_check
    column: profit_margin
    rules:
      custom:
        rule: is_positive_margin
        params:
          min_margin: 0.05

Plugin registry

  • load_from_file() imports the Python module and registers all @custom_rule decorated functions
  • Registered rules become available through the RuleFactory alongside built-in rules
  • The global registry tracks all loaded custom rules

Data Profiling

Running profiling

bash
# Direct file path
datacheck profile data.csv

# Auto-discover config
datacheck profile

# Explicit config file
datacheck profile --config checks.yaml

# Named source
datacheck profile --source production_db --sources-file sources.yaml

# Named source with table
datacheck profile --source production_db --table orders

Profile options

FlagDescription
--format / -fOutput format: terminal (default), json, markdown
--output / -oWrite output to file
--outlier-methodOutlier detection method: zscore (default) or iqr
--suggestions / --no-suggestionsShow rule suggestions (default: enabled)
--correlations / --no-correlationsShow correlation matrix
bash
datacheck profile data.csv --format json -o profile.json
datacheck profile --outlier-method iqr --correlations
datacheck profile --format markdown -o report.md

What profiling computes

  • Basic counts: total rows, null count, unique count, duplicate count, completeness percentage
  • Numeric statistics: min, max, mean, median, standard deviation, 25th/50th/75th percentiles
  • Value distributions: top N values with counts
  • Outlier detection: Z-score method (|z| > 3.0) or IQR method (values outside Q1-1.5*IQR to Q3+1.5*IQR)
  • Correlation matrix: Pearson correlation between all numeric columns
  • Quality scoring: 0-100 score per column and per dataset

Quality scoring

Each column receives a 0-100 quality score based on:

FactorWhat it measures
CompletenessPenalizes null/missing values
UniquenessPenalizes duplicate values
ValidityType consistency across the column
ConsistencyLow variance in categorical columns

The dataset score is a weighted average of all column scores.

Rule suggestions

The profiler automatically suggests validation rules based on data patterns:

  • Numeric columns: range rules, outlier thresholds, distribution checks, type (int vs numeric)
  • String columns: length constraints, regex patterns, allowed value sets
  • Temporal columns: date format detection, timestamp ranges (with margin), no_future_timestamps
  • Semantic columns: email_valid, phone_valid, url_valid, json_valid inferred from column names and content
  • Cross-column: sum_equals auto-detected when two numeric columns sum to a third
  • All columns: null checks, uniqueness rules

Schema Detection and Evolution

Commands

bash
datacheck schema capture              # Save current schema as baseline
datacheck schema compare              # Compare current data against baseline
datacheck schema show                 # Display detected schema
datacheck schema list                 # List all saved baselines
datacheck schema history              # View capture history

Schema capture

bash
datacheck schema capture data.csv
datacheck schema capture --source production_db --sources-file sources.yaml
datacheck schema capture --name v2-baseline
datacheck schema capture --baseline-dir ./schemas
datacheck schema capture --no-history
FlagDescription
--name / -nBaseline name (default: baseline)
--baseline-dirStorage directory (default: .datacheck/schemas/)
--save-history / --no-historySave to history (default: enabled)

Schema compare

bash
datacheck schema compare data.csv
datacheck schema compare --baseline v2-baseline
datacheck schema compare --fail-on-breaking
datacheck schema compare --rename-threshold 0.9
datacheck schema compare --format json
FlagDescription
--baseline / -bBaseline name to compare against (default: baseline)
--rename-thresholdSimilarity threshold for rename detection (0.0-1.0, default: 0.8)
--fail-on-breakingExit with code 1 on breaking changes
--format / -fOutput format: terminal (default) or json

Schema compare exit codes

CodeMeaning
0Compatible — no breaking changes
1Breaking changes detected (with --fail-on-breaking)
2Baseline not found
3Data load error
4Unexpected error

What schema tracks

For each column: name, data type, nullable status, position, unique value count, null percentage. For the dataset: row count, source identifier, capture timestamp.

Change types detected

ChangeCompatibility Level
Column addedCOMPATIBLE
Column removedBREAKING
Column renamedWARNING
Nullable changedWARNING
Order changedCOMPATIBLE

Type change compatibility

Compatible changes (widening): int→float, int→string, float→string, bool→string, date→datetime, date→string, datetime→string

Breaking changes (narrowing): float→int, string→int, string→float, string→bool, datetime→date, string→datetime, string→date

Baseline storage

  • Baselines are stored as JSON files in .datacheck/schemas/
  • History entries are stored in .datacheck/schemas/history/ with timestamps (e.g. schema_20260212_143000.json)
  • Use datacheck schema list to see all baselines
  • Use datacheck schema history --limit 20 to see recent history

Sampling Strategies

Available strategies

StrategyDescriptionKey Parameters
randomSimple random samplingsample_rate or sample_count, seed
stratifiedPreserve value distributions across groupsstratify_column, min_per_stratum
time_basedSample within a time windowtime_column, start_date, end_date
error_focusedPrioritize rows matching error conditionserror_conditions (e.g. ['age<0', 'price>10000'])
adaptiveAdjust sample size based on data characteristicstarget_quality, initial_size
reservoirSingle-pass sampling for streaming datasample_count
systematicEvery Nth rowsample_rate
top_nFirst N rows--top N

CLI sampling flags

bash
# Random sampling
datacheck validate --sample-rate 0.1              # 10% of rows
datacheck validate --sample-count 1000            # Exactly 1000 rows
datacheck validate --sample-count 1000 --seed 42  # Reproducible

# First N rows
datacheck validate --top 500

# Strategy-based
datacheck validate --sample-strategy stratified --stratify region
datacheck validate --sample-strategy time_based --time-column created_at --start-date 2026-01-01 --end-date 2026-02-01
datacheck validate --sample-strategy error_focused --error-indicators "age<0,price>10000"
FlagDescription
--sample-rateFraction to sample (0.0-1.0)
--sample-countExact number of rows to sample
--topFirst N rows only
--sample-strategyStrategy name: random, stratified, time_based, error_focused, adaptive, reservoir
--stratifyColumn for stratified sampling
--seedRandom seed for reproducibility
--time-columnColumn for time-based sampling
--start-dateStart date (ISO format)
--end-dateEnd date (ISO format)
--error-indicatorsComma-separated error conditions

CLI Command Reference

datacheck validate

Run validation against data files or databases.

Data source flags:

FlagDescription
data_source (positional)File path or connection string
--config / -cPath to validation config YAML
--sourceNamed source from sources.yaml
--sources-filePath to sources YAML file
--table / -tDatabase table name
--where / -wSQL WHERE clause for filtering
--query / -qCustom SQL query
--schema / -sSchema/dataset name

Warehouse-specific flags:

FlagDescription
--warehouseSnowflake warehouse name
--credentialsPath to credentials file (BigQuery service account)
--regionAWS region (Redshift IAM auth)
--clusterCluster identifier (Redshift IAM auth)
--iam-authUse IAM authentication (Redshift)

Delta Lake flags:

FlagDescription
--delta-versionDelta Lake version to load (time travel)
--delta-timestampTimestamp to load data as of (ISO 8601)
--storage-optionsJSON string of storage options for cloud access

Sampling flags: See Sampling Strategies.

Execution flags:

FlagDescription
--parallelEnable multi-core parallel execution
--workersNumber of worker processes (default: CPU count)
--chunk-sizeRows per chunk for parallel processing (default: 10,000)
--progress / --no-progressShow/hide progress bar

Output flags:

FlagDescription
--output / -oSave results to a JSON file
--csv-exportExport failure details as CSV
--suggestions / --no-suggestionsShow improvement suggestions (default: enabled)
--slack-webhookSlack webhook URL for notifications

Logging flags:

FlagDescription
--log-levelLog level: DEBUG, INFO, WARNING, ERROR, CRITICAL
--log-formatLog format: console (human-readable) or json (machine-parseable)
--log-filePath to log file (with automatic rotation)
--verbose / -vShortcut for --log-level DEBUG

datacheck profile

Generate data quality profiles with statistics, quality scores, and rule suggestions.

Same data source flags as validate, plus:

FlagDescription
--format / -fOutput format: terminal (default), json, markdown
--output / -oWrite output to file
--outlier-methodDetection method: zscore (default) or iqr
--suggestions / --no-suggestionsShow rule suggestions
--correlations / --no-correlationsShow correlation matrix

datacheck config

Configuration management commands.

SubcommandDescription
config initGenerate config from template
config init --template <name>Use specific template (basic, ecommerce, healthcare, finance, saas, iot, rules-reference, sources)
config init --with-sample-dataAlso generate a sample CSV file
config init --sample-rows NNumber of sample rows to generate (default: 100)
config init --forceOverwrite existing config file
config validate <file>Validate config file syntax and rule definitions
config validate --strictFail on warnings too
config show <file>Show fully resolved config (env vars + inheritance applied)
config show --format yaml/jsonOutput format
config show --no-resolve-envSkip environment variable resolution
config show --no-resolve-extendsSkip config inheritance resolution
config merge <files...>Merge multiple configs (later files override earlier)
config merge -o output.yamlWrite merged result to file
config generate <file>Auto-generate rules from data analysis
config generate --confidenceMinimum confidence: low, medium (default), high
config templatesList available templates with descriptions
config env <file>Show environment variables referenced in config

datacheck schema

Schema evolution detection and management.

SubcommandDescription
schema captureSave current schema as baseline
schema compareCompare current data against baseline
schema showDisplay detected schema (columns, types, nullable, stats)
schema listList all saved baseline schemas
schema historyView capture history (newest first)

datacheck version

Display version information.

Exit codes

CodeMeaning
0All rules passed (or only warning/info severity failures)
1Some error-severity rules failed
2Configuration error
3Data loading error
4Unexpected error

Output and Reporting

Terminal output

DataCheck uses Rich-formatted terminal output with color-coded results:

  • Green: Passed rules
  • Red: Failed rules
  • Yellow: Errors during rule execution

Output includes a statistics table (records, columns, rules, pass/fail counts), detailed failure tables (check name, column, failure count, sample values), and actionable improvement suggestions.

JSON export

bash
datacheck validate --output results.json

Exports full validation results in machine-readable JSON format, including all rule results, failure details, and summary statistics. Use this for automation and CI/CD integration.

CSV export

bash
datacheck validate --csv-export failures.csv

Exports failure details as CSV with columns: check_name, column, severity, failed_rows, reason, suggestion.

Markdown reports

bash
datacheck profile --format markdown -o report.md

Generates markdown-formatted profile reports with tables, statistics, and quality scores.

Slack notifications

Configure the webhook in your config file so you don't need to pass it every time:

yaml
notifications:
  slack_webhook: "${SLACK_WEBHOOK}"
  mention_on_failure: true    # @channel on failures (default: false)

Or pass it via the CLI (overrides the config value):

bash
datacheck validate --slack-webhook https://hooks.slack.com/services/...

Sends validation results to Slack with:

  • Color-coded messages (green for pass, red for fail)
  • Summary statistics and failed rules
  • Optional @channel mention on failures (via mention_on_failure)
  • Up to 5 failed rule details with row counts

Parallel Execution and Performance

Enabling parallel mode

bash
datacheck validate --parallel
datacheck validate --parallel --workers 4
datacheck validate --parallel --chunk-size 50000
datacheck validate --parallel --progress
FlagDescription
--parallelEnable multi-core parallel execution
--workersNumber of worker processes (default: CPU count)
--chunk-sizeRows per chunk (default: 10,000)
--progress / --no-progressShow/hide progress bar

How parallel execution works

  1. Splits the DataFrame into chunks based on --chunk-size
  2. Processes chunks in parallel using multiprocessing.Pool
  3. Aggregates results across chunks (combines pass/fail counts, merges failure details)
  4. Automatically falls back to sequential execution for small datasets
  5. Shows a Rich progress bar with spinner, elapsed time, and remaining time

Performance features

  • PyArrow backend: Vectorized operations for faster validation (e.g. fast null count via Arrow)
  • Lazy loading: Cloud connectors are loaded only when needed — no unnecessary dependencies
  • Memory optimization: Memory-aware chunk sizing, worker auto-scaling, and large file handling
  • Caching: Regex compilation caching (@lru_cache) and compute-once patterns for expensive operations
  • Vectorized rules: NumPy/Pandas vectorized operations — no Python loops in hot paths

Logging

Log configuration

bash
datacheck validate --verbose                        # DEBUG level
datacheck validate --log-level WARNING               # Specific level
datacheck validate --log-format json                 # Machine-parseable JSON logs
datacheck validate --log-file validation.log         # Log to file (with rotation)
datacheck validate --log-level DEBUG --log-format json --log-file debug.log
FlagDescription
--log-levelDEBUG, INFO, WARNING, ERROR, CRITICAL
--log-formatconsole (human-readable, default) or json (machine-parseable)
--log-filePath to log file (automatic rotation)
--verbose / -vShortcut for --log-level DEBUG

Logging features

  • Structured logging: Console and JSON formatters for different use cases
  • Sensitive data masking: Automatically masks credentials and passwords in log output
  • Trace IDs: Unique trace ID per validation run for log correlation across systems
  • File rotation: Automatic log file rotation to prevent unbounded growth

Security

Credential handling

  • Environment variables: Use ${VAR} and ${VAR:-default} syntax in config files — never hardcode credentials
  • Credential files: Load credentials from external files
  • Password masking: Credentials are automatically masked in logs and terminal output
  • Config env audit: Use datacheck config env to verify all required variables are set

Connection security

  • Connection string validation before attempting connections
  • SQL injection prevention: table name validation, WHERE clause scanning, parameterized queries
  • Path traversal prevention with null byte and symlink detection
  • SSL/TLS enforcement for warehouse connections

Airflow Integration

DataCheck provides two Airflow operators for use in DAGs, plus a simpler BashOperator pattern.

DataCheckOperator

Run data validation inside Airflow DAGs:

python
from datacheck.airflow.operators import DataCheckOperator

validate_orders = DataCheckOperator(
    task_id="validate_orders",
    config_path="/path/to/datacheck.yaml",
    file_path="/data/orders.csv",
    fail_on_error=True,
    push_results=True,
    min_pass_rate=95.0,
)

Parameters:

ParameterTypeDefaultDescription
config_pathstrrequiredPath to validation config YAML
file_pathstrNonePath to data file (CSV, Parquet, Avro, Delta, etc.)
sources_filestrNonePath to sources YAML (overrides config)
source_namestrNoneNamed source from sources.yaml
tablestrNoneDatabase table name
wherestrNoneSQL WHERE clause
querystrNoneCustom SQL query
sample_ratefloatNoneRandom sample fraction (0.0-1.0)
parallelboolFalseEnable multi-core validation
workersintNoneNumber of worker processes
min_pass_ratefloat0Minimum rule pass rate (0-100, 0=disabled)
min_quality_scorefloat0Minimum quality score (0-100, 0=disabled)
fail_on_errorboolTrueFail Airflow task on validation failure
push_resultsboolTruePush results to XCom

Template fields: config_path, file_path, sources_file, source_name, table, where, query (supports .yaml and .yml extensions)

XCom output:

  • validation_results: Full results dictionary
  • passed: Boolean pass/fail result
  • pass_rate: Percentage of rules passed

Data source resolution order:

  1. file_path — file-based validation
  2. source_name + sources_file — named source validation
  3. Config default (source or data_source from config)

DataCheckSchemaOperator

Detect schema changes inside Airflow DAGs:

python
from datacheck.airflow.operators import DataCheckSchemaOperator

check_schema = DataCheckSchemaOperator(
    task_id="check_schema",
    config_path="/path/to/datacheck.yaml",
    file_path="/data/orders.csv",
    baseline_name="orders-v2",
    fail_on_breaking=True,
    push_results=True,
)

Parameters:

ParameterTypeDefaultDescription
config_pathstrrequiredPath to validation config YAML
file_pathstrNonePath to data file
sources_filestrNonePath to sources YAML
source_namestrNoneNamed source from sources.yaml
tablestrNoneDatabase table name
baseline_namestr"baseline"Baseline identifier
baseline_dirstr".datacheck/schemas"Baseline storage directory
fail_on_breakingboolTrueFail Airflow task on breaking schema changes
push_resultsboolTruePush results to XCom

XCom output:

  • schema_results: Schema comparison results dictionary
  • schema_compatible: Boolean compatibility flag

Auto-captures a new baseline if none exists yet.

BashOperator pattern

For simpler integration, use Airflow's BashOperator directly:

python
from airflow.operators.bash import BashOperator

validate = BashOperator(
    task_id="validate_data",
    bash_command="datacheck validate --config /path/to/config.yaml --output /tmp/results.json",
)

Exit codes work directly with Airflow task status — exit code 0 means success, any non-zero code fails the task.


CI/CD Integration

DataCheck uses standard exit codes for automation. Any non-zero exit code fails the pipeline.

CodeMeaningCI/CD Effect
0All rules passedPipeline continues
1Error-severity failuresPipeline fails (blocks deploy)
2Configuration errorPipeline fails
3Data loading errorPipeline fails
4Unexpected errorPipeline fails

GitHub Actions

yaml
name: Data Quality Check
on: [push]

jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.12"
      - name: Install DataCheck
        run: pip install datacheck-cli
      - name: Validate Data
        run: datacheck validate --output results.json
      - name: Upload Results
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: validation-results
          path: results.json

GitLab CI

yaml
validate_data:
  image: python:3.12
  script:
    - pip install datacheck-cli
    - datacheck validate --output results.json
  artifacts:
    paths:
      - results.json
    when: always

Jenkins

groovy
pipeline {
    agent any
    stages {
        stage('Data Validation') {
            steps {
                sh 'pip install datacheck-cli'
                sh 'datacheck validate --output results.json'
            }
            post {
                always {
                    archiveArtifacts artifacts: 'results.json', allowEmptyArchive: true
                }
            }
        }
    }
}

Python API

ValidationEngine

python
from datacheck import ValidationEngine

engine = ValidationEngine(config_path=".datacheck.yaml")
summary = engine.validate()

print(f"Records: {summary.total_rows:,} rows, {summary.total_columns} columns")
print(f"Passed: {summary.passed_rules}/{summary.total_rules}")

for result in summary.get_failed_results():
    print(f"  FAIL: {result.rule_name} on {result.column} ({result.failed_rows} rows)")

Constructor parameters:

ParameterDescription
config / config_pathConfiguration object or path to YAML file
parallelEnable parallel execution (bool)
workersNumber of worker processes (int)
chunk_sizeRows per chunk for parallel execution (int)
show_progressShow progress bar (bool)
notifierOptional notifier instance (e.g. SlackNotifier)
sources_filePath to sources YAML (overrides config)

Methods:

MethodDescription
validate()Validate using config defaults
validate_file(file_path, **kwargs)Validate a file (supports sampling, delta time travel)
validate_sources(source_name, table, where, query, **kwargs)Validate a named source
validate_dataframe(df)Validate a pre-loaded pandas DataFrame

ValidationSummary

PropertyTypeDescription
total_rulesintTotal number of rules executed
passed_rulesintRules that passed
failed_rulesintRules that failed
failed_errorsintFailed rules with error severity
failed_warningsintFailed rules with warning severity
failed_infointFailed rules with info severity
error_rulesintRules that encountered execution errors
all_passedboolWhether all rules passed
has_errorsboolWhether any execution errors occurred
resultslistList of RuleResult objects
total_rowsintNumber of data rows
total_columnsintNumber of columns
timestampstrExecution timestamp
durationfloatExecution duration in milliseconds
trace_idstrUnique run identifier for log correlation
MethodReturnsDescription
get_passed_results()listRuleResults that passed
get_failed_results()listRuleResults that failed
get_error_results()listRuleResults with execution errors
to_dict()dictSerialize to dictionary

RuleResult

PropertyTypeDescription
rule_namestrRule identifier
columnstrTarget column
passedboolWhether the rule passed
total_rowsintTotal rows checked
failed_rowsintRows that failed
rule_typestrRule type name
check_namestrCheck name from config
severitystrerror, warning, or info
failure_detailsFailureDetailDetailed failure information
errorstrError message if rule errored
execution_timefloatExecution time in milliseconds

DataProfiler

python
from datacheck.profiling import DataProfiler

profiler = DataProfiler(outlier_method="zscore")
profile = profiler.profile(df, name="orders")

Industry Templates

DataCheck ships with 8 config templates:

TemplateUse Case
basicGeneric starter config for any data
ecommerceOrder data, product catalogs, customer records
healthcarePatient data, HIPAA compliance, date formats
financeTransaction data, SOX compliance, sum validations
saasUser activity, subscription data, engagement metrics
iotSensor data, time-series, device telemetry
rules-referenceComplete reference of all validation rules with examples
sourcesData source connection templates with environment variable support
bash
datacheck config init --template ecommerce --with-sample-data
datacheck config init --template healthcare --with-sample-data --sample-rows 500
datacheck config templates   # List all templates with descriptions

Error Handling

Exception hierarchy

ExceptionWhen
DataCheckErrorBase exception for all DataCheck errors
ConfigurationErrorInvalid config structure, missing required fields
ValidationErrorRule execution failures
DataLoadErrorFile not found, encoding issues, connection failures
RuleDefinitionErrorInvalid rule parameters or missing required arguments
UnsupportedFormatErrorUnknown file format or missing optional library
ColumnNotFoundErrorColumn not found in DataFrame
EmptyDatasetErrorNo rows in loaded dataset

All exceptions inherit from DataCheckError, so you can catch them broadly:

python
from datacheck.exceptions import DataCheckError, ConfigurationError, DataLoadError

try:
    engine = ValidationEngine(config_path="config.yaml")
    summary = engine.validate()
except ConfigurationError as e:
    print(f"Config error: {e}")
except DataLoadError as e:
    print(f"Data load error: {e}")
except DataCheckError as e:
    print(f"DataCheck error: {e}")