Part 10 — Data Quality & Testing

Learn the Pattern · Part 10

Data Quality & Testing

A pipeline that runs but delivers wrong data is more dangerous than one that fails loudly — because nobody notices until the decision is already made.

In 60 seconds

Wrong data is silent. The dashboard still loads. The number is just… wrong. The fix: test your data like you test your code.

Freshness — is the data recent enough?
Volume — are row counts within the expected range?
Nulls — are required fields actually filled?
Uniqueness — no duplicate keys?
Referential — does every foreign key have a match?
Distribution — do today's values look like history?

The Silent Failure Problem

An application bug is loud — the page errors, someone notices. A data bug is silent. The pipeline succeeds, the dashboard renders, and the number is simply wrong. A duplicated join doubled revenue; an upstream schema change turned a column to nulls; a timezone shift moved yesterday's sales into today. Nobody sees it until a decision has already been made on bad data — and by then trust is gone.

The cure is to stop treating data as something you only look at, and start treating it as something you test — automatically, inside the pipeline, with failures that stop the build before bad data reaches anyone.

The Core Checks

1. The Six Questions Every Table Should Answer

Freshness (is it recent?), volume (right number of rows?), nulls (required fields present?), uniqueness (no duplicate keys?), referential integrity (foreign keys resolve?), and distribution (do the values look like they normally do?). These six cover the overwhelming majority of real-world data incidents — and they're the same questions on every platform.

2. Tests Live in the Pipeline, Not in a Person's Memory

Quality checks should run automatically every time data is built, and a failing test should block the bad data from being published — exactly like a failing unit test blocks a deploy. Manual spot-checks don't scale and quietly stop happening. dbt tests, Great Expectations, and Soda all embed these assertions directly into the run.

3. Quality vs. Observability

Data quality is asserting specific rules (this column is never null). Data observability is the broader, monitoring-style view of your whole platform's health — freshness, volume, and schema tracked over time with automatic anomaly alerts, analogous to Prometheus/Grafana for applications. Quality catches the rules you thought to write; observability surfaces the problems you didn't.

4. Contracts Shift Failures Left

A data contract is a formal agreement on a dataset's schema and semantics between producer and consumer. It turns silent runtime breakages ("why is this column suddenly null?") into loud, early, deploy-time violations — catching the break before it ever reaches a pipeline.

Where Quality Gates Sit

Tests gate each stage — bad data is stopped and alerted on, never silently published.

Same Pattern, Every Platform

The Pattern	Snowflake	Databricks	BigQuery	Microsoft Fabric
In-pipeline tests	dbt tests	dbt tests / DLT expectations	dbt tests	dbt / SQL checks in pipelines
Quality frameworks	Great Expectations / Soda	Great Expectations / Soda	Great Expectations / Soda	Great Expectations / Soda
The six checks	Identical questions — only the syntax changes

Feature names evolve — treat this as a capability map, and confirm specifics against current vendor docs.

The takeaway: the six quality questions — freshness, volume, nulls, uniqueness, referential integrity, distribution — are platform-independent. Wire them into the pipeline as blocking tests on any engine and you've built trust. The tool is the product; the six checks are the pattern.

← Part 9 Partitioning & Clustering Part 11 → Orchestration & CDC

← Back to Publications