Orchestration & CDC
How data stays fresh without anyone babysitting it — by running the right steps in the right order, and by moving only what actually changed.
Two problems, two patterns — both about keeping data fresh, automatically.
- Orchestration — run the right steps, in the right order, on schedule. Model it as a DAG (a flow of dependent tasks).
- If a step fails, stop and alert; retries, schedules, and backfills are automated.
- CDC (Change Data Capture) — move only what changed. Read the source database's transaction log.
- Capture & apply inserts/updates/deletes as they happen — no full reloads.
- Together they keep the warehouse fresh, cheaply, hands-off.
Keeping Data Fresh, Automatically
A modern platform isn't a one-time load — it's a living system that must refresh continuously and reliably. Two patterns make that happen without a human in the loop. Orchestration answers "what runs, in what order, and what happens when something breaks?" Change Data Capture answers "how do we move new data efficiently instead of reloading everything?"
The Two Patterns
1. Orchestration as a DAG
Pipelines are modeled as a DAG — a Directed Acyclic Graph of tasks with dependencies. "Build the staging tables, then the dimensions, then the facts, then refresh the dashboard." The orchestrator runs tasks in dependency order, in parallel where possible, and if a task fails it stops the dependents and alerts you instead of silently producing broken downstream data.
2. Schedules, Retries, and Backfills
Orchestrators add the operational muscle: run on a schedule (or trigger on an event), retry transient failures automatically, and backfill — re-run a pipeline across a historical date range when you add a new column or fix a bug. This is what turns a fragile script into a dependable production system.
3. CDC — Move Only What Changed
Reloading a 500-million-row table every night to capture a few thousand changes is wasteful and slow. Change Data Capture reads the database's transaction log (the same log the DB uses for replication) and streams just the inserts, updates, and deletes downstream. The result: near-real-time freshness at a fraction of the cost, with minimal load on the source system.
4. Why They Belong Together
CDC delivers fresh changes; orchestration sequences the transformations that turn those changes into trustworthy, modeled data. One keeps the raw inputs current; the other keeps the downstream tables correct. Together they are the "freshness engine" of the platform.
The Freshness Engine, Visualized
An orchestrator schedules each stage; CDC feeds only the changes; particles show data flowing through:
CDC feeds only changes; the orchestrator (top) schedules each stage and stops the flow if a step fails.
Same Pattern, Every Platform
| The Pattern | Open-source | Snowflake | Databricks | Microsoft Fabric |
|---|---|---|---|---|
| Orchestration | Airflow · Dagster · Prefect | Tasks / dbt Cloud | Jobs / Workflows | Data Factory pipelines |
| CDC ingest | Debezium | Streams / partner connectors | Delta Live Tables / Auto Loader | Data Factory / partner CDC |
| Apply changes | MERGE / upsert | MERGE / Streams & Tasks | MERGE INTO | MERGE / Dataflows Gen2 |
Feature names evolve — treat this as a capability map, and confirm specifics against current vendor docs.
The takeaway: "orchestrate a DAG" and "capture changes from the log" are the same two patterns whether you run Airflow + Debezium yourself or click through Fabric Data Factory. Learn the patterns and the freshness story is the same on every platform.