Data Engineer: The Builder
The role responsible for designing, building, and operating the pipelines and infrastructure that make data reliable, accessible, and cost-effective at scale.
What a Data Engineer Actually Does
A Data Engineer's core output is reliable data delivery. If a Data Scientist needs a clean, well-structured dataset to train a model, or a BI dashboard needs fresh sales figures every morning, it is the Data Engineer who built the system that makes that happen — and who gets paged when it breaks.
The role sits between source systems (operational databases, APIs, event streams) and the consuming layers (analytics, ML, BI). It is fundamentally an infrastructure engineering role that requires both software engineering discipline and deep knowledge of data semantics.
Core Concepts
1. ETL vs. ELT — The Paradigm Shift
Traditional ETL (Extract, Transform, Load) transformed data before loading it into a warehouse. With cloud warehouses that have essentially unlimited compute (Snowflake, BigQuery, Databricks), it became cheaper and more flexible to load raw data first and transform it inside the warehouse — this is ELT (Extract, Load, Transform).
ETL — Transform First
- Transformation happens outside the warehouse
- Only clean data lands in the target
- Hard to reprocess historical data
- Common with on-premise warehouses
- Tools: SSIS, Informatica, Talend
ELT — Load First
- Raw data lands in a Bronze layer
- Transformations run inside the warehouse
- Easy to reprocess with new logic
- Standard pattern with cloud warehouses
- Tools: dbt, Spark, SQL on Snowflake/BQ
2. Batch Processing vs. Real-Time Streaming
Batch processing is the traditional model: collect data over a period, then process the whole batch at once (nightly runs, hourly jobs). It is simple, cost-effective, and sufficient for most analytical use cases. Streaming processes each event as it arrives — essential for fraud detection, real-time recommendations, IoT monitoring, and any use case where latency in minutes is unacceptable.
Apache Kafka has become the dominant streaming backbone: a distributed, fault-tolerant message queue that decouples producers from consumers. AWS Kinesis and Google Pub/Sub offer managed alternatives in cloud environments. The modern pattern is often a Lambda Architecture (separate batch and streaming paths) or increasingly a Kappa Architecture (single streaming path that handles both real-time and reprocessing).
3. Data Quality & Observability
A pipeline that runs reliably but delivers wrong data is worse than a pipeline that fails visibly. Data quality means asserting contracts on your data: row counts are within expected range, null rates are below a threshold, referential integrity between tables is maintained, value distributions match historical patterns.
Data observability is the broader practice of having visibility into the health of your entire data platform in near-real time — analogous to application monitoring (Prometheus/Grafana) but for data assets. Tools like Monte Carlo, Soda Core, and Great Expectations embed quality checks directly into pipelines. dbt tests (singular and generic) are the most common entry point for SQL-based quality assertions.
4. Data Contracts
A Data Contract is a formal, machine-readable agreement between a data producer (a microservice, an operational system) and a data consumer (a pipeline, a dashboard, an ML model) about the structure, semantics, and SLAs of a dataset. Think of it as an API contract applied to data.
Without contracts, schema changes in an upstream system silently break downstream pipelines — often discovered only when a dashboard shows zeros or an ML model starts producing nonsense. Data Contracts, often defined in YAML using frameworks like Data Contract CLI or Soda, make these agreements explicit and enforceable, shifting schema breaks from silent runtime failures to detectable deployment-time violations.
5. FinOps for Data Teams
Cloud warehouses charge by compute and storage. A poorly tuned dbt model that scans a full 10TB table on every run can cost thousands of dollars per month in BigQuery or Snowflake credits. FinOps for Data means treating cloud spend as a first-class engineering concern: partitioning tables by date, clustering by frequently-filtered columns, caching intermediate results, right-sizing Databricks clusters, and setting query cost alerts.
Modern Real-Time + Batch Pipeline Architecture
The following diagram shows how a production data pipeline integrates streaming and batch ingestion, processing, and the Medallion storage pattern:
flowchart LR
subgraph SRC["Data Sources"]
DB[("RDBMS\nPostgres · Oracle")]
API["REST APIs\nWebhooks"]
IOT["IoT / Events\nSensors · Logs"]
end
subgraph INGEST["Ingestion Layer"]
KAFKA["Apache Kafka\nReal-Time Streams"]
KINESIS["AWS Kinesis\nCloud Events"]
BATCH["Batch Jobs\nS3 · SFTP · Files"]
end
subgraph PROC["Processing & Orchestration"]
SPARK["PySpark\nDatabricks"]
DBT["dbt\nTransformations"]
AF["Apache Airflow\nOrchestration"]
end
subgraph STORE["Medallion Storage"]
B[("Bronze\nRaw Data")]
S[("Silver\nCleaned Data")]
G[("Gold\nBusiness-Ready")]
end
subgraph SERVE["Serving Layer"]
SF["Snowflake\nBigQuery"]
BI["BI Dashboards\nTableau · Looker"]
ML["ML Training\nFeature Store"]
end
DB --> KAFKA
API --> KAFKA
IOT --> KINESIS
DB --> BATCH
KAFKA --> SPARK
KINESIS --> SPARK
BATCH --> SPARK
SPARK --> B
B --> DBT
DBT --> S
S --> DBT
DBT --> G
AF -.->|"schedules"| SPARK
AF -.->|"schedules"| DBT
G --> SF
SF --> BI
SF --> ML
A modern pipeline: Kafka handles real-time; batch handles file-based sources; PySpark processes into Bronze; dbt transforms to Silver and Gold; Airflow orchestrates the whole flow.
The Tool Stack, Explained
Apache Airflow
Python-based workflow orchestrator. Define DAGs (Directed Acyclic Graphs) to schedule and monitor pipelines. The de-facto standard for batch orchestration. Alternatives: Prefect, Dagster.
Apache Spark / PySpark
Distributed in-memory processing engine. Handles datasets too large for a single machine. Essential for batch transformation at scale. Managed via Databricks or EMR. PySpark = Python API for Spark.
dbt (data build tool)
SQL-based transformation framework. Write SELECT statements; dbt handles the CREATE TABLE/VIEW, incremental logic, documentation, lineage graphs, and quality tests. Runs inside your warehouse.
Apache Kafka
Distributed, fault-tolerant event streaming platform. Producers publish events; consumers read at their own pace. Retention is configurable (hours to forever). Used as the backbone of real-time architectures.
Snowflake
Cloud data warehouse with separated storage and compute. Supports semi-structured data (JSON), time travel, and data sharing across organizations. Cost model: pay per query compute second.
Google BigQuery
Serverless, columnar cloud warehouse. Serverless means no cluster management. Excellent for large-scale ad-hoc analysis. Integrates natively with GCP services and Vertex AI.
Databricks
Unified analytics platform built on Apache Spark. Combines batch and streaming processing, ML model training, and Delta Lake (open table format with ACID transactions).
Docker & Terraform
Docker containerizes pipeline code for reproducible execution. Terraform provisions cloud infrastructure as code — warehouses, buckets, Kafka clusters, IAM roles — in a version-controlled, repeatable way.
The Medallion Architecture (Bronze → Silver → Gold) is the most widely adopted storage pattern for modern data platforms. It is covered in depth in Part 3 (Data Architect), but the Data Engineer is responsible for implementing the pipelines that populate each layer. Understanding the pattern is essential for both roles.
Key Skills for Data Engineers in 2025
- Advanced SQL (window functions, CTEs, partitioning)
- Python for pipeline scripting and testing
- PySpark for distributed processing
- Airflow / Prefect DAG authoring
- dbt models, tests, and documentation
- Data Contract design and enforcement
- Snowflake or BigQuery administration
- Kafka producer/consumer patterns
- Docker and container orchestration basics
- Terraform for IaC provisioning
- Data quality frameworks (Great Expectations, Soda)
- Cloud cost monitoring and query optimization