Part 2 — Data Engineer: The Builder

The Modern Data Ecosystem — Part 2

Data Engineer: The Builder

The role responsible for designing, building, and operating the pipelines and infrastructure that make data reliable, accessible, and cost-effective at scale.

What a Data Engineer Actually Does

A Data Engineer's core output is reliable data delivery. If a Data Scientist needs a clean, well-structured dataset to train a model, or a BI dashboard needs fresh sales figures every morning, it is the Data Engineer who built the system that makes that happen — and who gets paged when it breaks.

The role sits between source systems (operational databases, APIs, event streams) and the consuming layers (analytics, ML, BI). It is fundamentally an infrastructure engineering role that requires both software engineering discipline and deep knowledge of data semantics.

Core Concepts

1. ETL vs. ELT — The Paradigm Shift

Traditional ETL (Extract, Transform, Load) transformed data before loading it into a warehouse. With cloud warehouses that have essentially unlimited compute (Snowflake, BigQuery, Databricks), it became cheaper and more flexible to load raw data first and transform it inside the warehouse — this is ELT (Extract, Load, Transform).

ETL — Transform First

Transformation happens outside the warehouse
Only clean data lands in the target
Hard to reprocess historical data
Common with on-premise warehouses
Tools: SSIS, Informatica, Talend

ELT — Load First

Raw data lands in a Bronze layer
Transformations run inside the warehouse
Easy to reprocess with new logic
Standard pattern with cloud warehouses
Tools: dbt, Spark, SQL on Snowflake/BQ

2. Batch Processing vs. Real-Time Streaming

Batch processing is the traditional model: collect data over a period, then process the whole batch at once (nightly runs, hourly jobs). It is simple, cost-effective, and sufficient for most analytical use cases. Streaming processes each event as it arrives — essential for fraud detection, real-time recommendations, IoT monitoring, and any use case where latency in minutes is unacceptable.

Apache Kafka has become the dominant streaming backbone: a distributed, fault-tolerant message queue that decouples producers from consumers. AWS Kinesis and Google Pub/Sub offer managed alternatives in cloud environments. The modern pattern is often a Lambda Architecture (separate batch and streaming paths) or increasingly a Kappa Architecture (single streaming path that handles both real-time and reprocessing).

3. Data Quality & Observability

A pipeline that runs reliably but delivers wrong data is worse than a pipeline that fails visibly. Data quality means asserting contracts on your data: row counts are within expected range, null rates are below a threshold, referential integrity between tables is maintained, value distributions match historical patterns.

Data observability is the broader practice of having visibility into the health of your entire data platform in near-real time — analogous to application monitoring (Prometheus/Grafana) but for data assets. Tools like Monte Carlo, Soda Core, and Great Expectations embed quality checks directly into pipelines. dbt tests (singular and generic) are the most common entry point for SQL-based quality assertions.

4. Data Contracts

A Data Contract is a formal, machine-readable agreement between a data producer (a microservice, an operational system) and a data consumer (a pipeline, a dashboard, an ML model) about the structure, semantics, and SLAs of a dataset. Think of it as an API contract applied to data.

Without contracts, schema changes in an upstream system silently break downstream pipelines — often discovered only when a dashboard shows zeros or an ML model starts producing nonsense. Data Contracts, often defined in YAML using frameworks like Data Contract CLI or Soda, make these agreements explicit and enforceable, shifting schema breaks from silent runtime failures to detectable deployment-time violations.

5. FinOps for Data Teams

Cloud warehouses charge by compute and storage. A poorly tuned dbt model that scans a full 10TB table on every run can cost thousands of dollars per month in BigQuery or Snowflake credits. FinOps for Data means treating cloud spend as a first-class engineering concern: partitioning tables by date, clustering by frequently-filtered columns, caching intermediate results, right-sizing Databricks clusters, and setting query cost alerts.

Modern Real-Time + Batch Pipeline Architecture

The following diagram shows how a production data pipeline integrates streaming and batch ingestion, processing, and the Medallion storage pattern:

flowchart LR
    subgraph SRC["Data Sources"]
        DB[("RDBMS\nPostgres · Oracle")]
        API["REST APIs\nWebhooks"]
        IOT["IoT / Events\nSensors · Logs"]
    end

    subgraph INGEST["Ingestion Layer"]
        KAFKA["Apache Kafka\nReal-Time Streams"]
        KINESIS["AWS Kinesis\nCloud Events"]
        BATCH["Batch Jobs\nS3 · SFTP · Files"]
    end

    subgraph PROC["Processing & Orchestration"]
        SPARK["PySpark\nDatabricks"]
        DBT["dbt\nTransformations"]
        AF["Apache Airflow\nOrchestration"]
    end

    subgraph STORE["Medallion Storage"]
        B[("Bronze\nRaw Data")]
        S[("Silver\nCleaned Data")]
        G[("Gold\nBusiness-Ready")]
    end

    subgraph SERVE["Serving Layer"]
        SF["Snowflake\nBigQuery"]
        BI["BI Dashboards\nTableau · Looker"]
        ML["ML Training\nFeature Store"]
    end

    DB --> KAFKA
    API --> KAFKA
    IOT --> KINESIS
    DB --> BATCH
    KAFKA --> SPARK
    KINESIS --> SPARK
    BATCH --> SPARK
    SPARK --> B
    B --> DBT
    DBT --> S
    S --> DBT
    DBT --> G
    AF -.->|"schedules"| SPARK
    AF -.->|"schedules"| DBT
    G --> SF
    SF --> BI
    SF --> ML

A modern pipeline: Kafka handles real-time; batch handles file-based sources; PySpark processes into Bronze; dbt transforms to Silver and Gold; Airflow orchestrates the whole flow.

The Tool Stack, Explained

Apache Airflow

Python-based workflow orchestrator. Define DAGs (Directed Acyclic Graphs) to schedule and monitor pipelines. The de-facto standard for batch orchestration. Alternatives: Prefect, Dagster.

Apache Spark / PySpark

Distributed in-memory processing engine. Handles datasets too large for a single machine. Essential for batch transformation at scale. Managed via Databricks or EMR. PySpark = Python API for Spark.

dbt (data build tool)

SQL-based transformation framework. Write SELECT statements; dbt handles the CREATE TABLE/VIEW, incremental logic, documentation, lineage graphs, and quality tests. Runs inside your warehouse.

Apache Kafka

Distributed, fault-tolerant event streaming platform. Producers publish events; consumers read at their own pace. Retention is configurable (hours to forever). Used as the backbone of real-time architectures.

Snowflake

Cloud data warehouse with separated storage and compute. Supports semi-structured data (JSON), time travel, and data sharing across organizations. Cost model: pay per query compute second.

Google BigQuery

Serverless, columnar cloud warehouse. Serverless means no cluster management. Excellent for large-scale ad-hoc analysis. Integrates natively with GCP services and Vertex AI.

Databricks

Unified analytics platform built on Apache Spark. Combines batch and streaming processing, ML model training, and Delta Lake (open table format with ACID transactions).

Docker & Terraform

Docker containerizes pipeline code for reproducible execution. Terraform provisions cloud infrastructure as code — warehouses, buckets, Kafka clusters, IAM roles — in a version-controlled, repeatable way.

The Medallion Architecture (Bronze → Silver → Gold) is the most widely adopted storage pattern for modern data platforms. It is covered in depth in Part 3 (Data Architect), but the Data Engineer is responsible for implementing the pipelines that populate each layer. Understanding the pattern is essential for both roles.

Key Skills for Data Engineers in 2025

Advanced SQL (window functions, CTEs, partitioning)
Python for pipeline scripting and testing
PySpark for distributed processing
Airflow / Prefect DAG authoring
dbt models, tests, and documentation
Data Contract design and enforcement

Snowflake or BigQuery administration
Kafka producer/consumer patterns
Docker and container orchestration basics
Terraform for IaC provisioning
Data quality frameworks (Great Expectations, Soda)
Cloud cost monitoring and query optimization

← Part 1 The Evolution of Data Architecture Part 3 → Data Architect: The Strategist

← Back to Publications