Data Quality & Testing
A pipeline that runs but delivers wrong data is more dangerous than one that fails loudly — because nobody notices until the decision is already made.
Wrong data is silent. The dashboard still loads. The number is just… wrong. The fix: test your data like you test your code.
- Freshness — is the data recent enough?
- Volume — are row counts within the expected range?
- Nulls — are required fields actually filled?
- Uniqueness — no duplicate keys?
- Referential — does every foreign key have a match?
- Distribution — do today's values look like history?
The Silent Failure Problem
An application bug is loud — the page errors, someone notices. A data bug is silent. The pipeline succeeds, the dashboard renders, and the number is simply wrong. A duplicated join doubled revenue; an upstream schema change turned a column to nulls; a timezone shift moved yesterday's sales into today. Nobody sees it until a decision has already been made on bad data — and by then trust is gone.
The cure is to stop treating data as something you only look at, and start treating it as something you test — automatically, inside the pipeline, with failures that stop the build before bad data reaches anyone.
The Core Checks
1. The Six Questions Every Table Should Answer
Freshness (is it recent?), volume (right number of rows?), nulls (required fields present?), uniqueness (no duplicate keys?), referential integrity (foreign keys resolve?), and distribution (do the values look like they normally do?). These six cover the overwhelming majority of real-world data incidents — and they're the same questions on every platform.
2. Tests Live in the Pipeline, Not in a Person's Memory
Quality checks should run automatically every time data is built, and a failing test should block the bad data from being published — exactly like a failing unit test blocks a deploy. Manual spot-checks don't scale and quietly stop happening. dbt tests, Great Expectations, and Soda all embed these assertions directly into the run.
3. Quality vs. Observability
Data quality is asserting specific rules (this column is never null). Data observability is the broader, monitoring-style view of your whole platform's health — freshness, volume, and schema tracked over time with automatic anomaly alerts, analogous to Prometheus/Grafana for applications. Quality catches the rules you thought to write; observability surfaces the problems you didn't.
4. Contracts Shift Failures Left
A data contract is a formal agreement on a dataset's schema and semantics between producer and consumer. It turns silent runtime breakages ("why is this column suddenly null?") into loud, early, deploy-time violations — catching the break before it ever reaches a pipeline.
Where Quality Gates Sit
Tests gate each stage — bad data is stopped and alerted on, never silently published.
Same Pattern, Every Platform
| The Pattern | Snowflake | Databricks | BigQuery | Microsoft Fabric |
|---|---|---|---|---|
| In-pipeline tests | dbt tests | dbt tests / DLT expectations | dbt tests | dbt / SQL checks in pipelines |
| Quality frameworks | Great Expectations / Soda | Great Expectations / Soda | Great Expectations / Soda | Great Expectations / Soda |
| The six checks | Identical questions — only the syntax changes | |||
Feature names evolve — treat this as a capability map, and confirm specifics against current vendor docs.
The takeaway: the six quality questions — freshness, volume, nulls, uniqueness, referential integrity, distribution — are platform-independent. Wire them into the pipeline as blocking tests on any engine and you've built trust. The tool is the product; the six checks are the pattern.