Part 4 — Batch vs Streaming

Learn the Pattern · Part 4

Batch vs Streaming: Choosing Your Latency

Two ways to move data, and one honest question behind the choice: how fresh does this data actually need to be for the decision it feeds?

In 60 seconds

Batch vs streaming is a question about exactly one thing: latency.

Batch — collect data over a window, then process the whole chunk (nightly, hourly, every 15 min).
Streaming — process each event the moment it arrives (sub-second to seconds).
Batch is simpler & cheaper — and good enough for most reports, dashboards, and ML training.
Streaming is for "now" — fraud blocks, live alerts, recommendations, IoT monitoring.
The rule — pick latency by the DECISION it feeds, not by hype.

It's All About Latency

Every data pipeline answers a question, and that question has a tolerance for staleness. A monthly board report doesn't care if the data is twelve hours old. A fraud system blocking a transaction cares about the next 200 milliseconds. Batch and streaming are just the two ends of that latency spectrum.

Batch — process in chunks

Scheduled: nightly, hourly, every 15 min
Simple to build, test, and reason about
Cost-effective at scale
Great for reporting, BI, ML training
Latency: minutes to hours

Streaming — process per event

Continuous: handle each event as it lands
More moving parts, harder to operate
Higher cost (always-on)
Needed for fraud, alerts, live personalization
Latency: milliseconds to seconds

What You Actually Need to Know

1. Most Questions Are Fine with Batch

The uncomfortable truth a lot of "real-time everything" content skips: the majority of business questions tolerate batch. Daily revenue, weekly cohorts, monthly churn — none of these need sub-second freshness. Reaching for streaming when batch would do adds cost and operational burden for no business gain.

2. Streaming Is a Tool, Not a Trophy

Use streaming when latency directly changes an outcome: a fraudulent charge blocked before it clears, a sensor reading that triggers a shutdown, a recommendation that must reflect the click you just made. If no decision changes because the data is seconds-fresh instead of hours-fresh, you don't need a stream.

3. The Streaming Backbone Is a Log

Real-time architectures are built on an append-only event log: producers publish events, consumers read them at their own pace. Apache Kafka is the dominant open-source backbone; AWS Kinesis, Google Pub/Sub, and Microsoft Fabric's Eventstream are managed equivalents. The pattern — a durable log decoupling producers from consumers — is identical across all of them.

Two Paths, One Platform

Both paths land in the same storage — the difference is only how quickly the data gets there.

Same Pattern, Every Platform

The Pattern	Snowflake	Databricks	BigQuery	Microsoft Fabric
Event log / ingest	Snowpipe Streaming	Structured Streaming + Auto Loader	Pub/Sub + Storage Write API	Eventstream
Stream processing	Streams & Tasks / Dynamic Tables	Spark Structured Streaming / DLT	Dataflow (streaming)	Eventstream / Spark streaming
Batch processing	Scheduled tasks / dbt	Jobs / dbt	Scheduled queries / dbt	Data Factory pipelines / Notebooks

Feature names evolve — treat this as a capability map, and confirm specifics against current vendor docs.

The takeaway: don't pick batch or streaming by fashion — pick it by the freshness the decision requires. The underlying patterns (a scheduled job vs. an event log + processor) are the same on every platform; only the latency and the cost change.

← Part 3 ETL vs ELT Part 5 → Data Modeling 101

← Back to Publications