Part 9 — Partitioning & Clustering

Learn the Pattern · Part 9

Partitioning & Clustering

The cheapest, highest-leverage way to make queries fast and bills small — and it comes down to one idea: make the engine read less.

In 60 seconds

Imagine a library with no shelves — every book in one giant pile. Finding one means reading them all. That's an unpartitioned table.

Partitioning — split data into folders by a column (usually date).
Result — query "last 7 days" and the engine skips ~99% of the data. Less scanned = faster AND cheaper.
Clustering / Z-ordering — sort data inside each partition.
Result — co-locates rows you filter together (e.g., by customer_id), so even less data is read.
The mental model — engines are fast because they READ LESS. Partition by what you filter on most.

Speed Comes from Reading Less

Every cloud query engine has the same fundamental cost: how much data it has to scan. Pricing on platforms like BigQuery is literally bytes-scanned; on warehouse compute it's time-on-engine, which is also driven by data read. So the highest-leverage optimization is almost never "a faster engine" — it's arranging the data so the engine can skip most of it.

Two techniques do this, and they stack. Partitioning physically separates rows into segments by a column. Clustering (a.k.a. Z-ordering) sorts the rows within those segments. Together they let the engine prune away everything irrelevant to your filter before it reads a single byte of it.

The Two Levers

1. Partitioning — Divide by Your Most Common Filter

Split the table into segments based on a column you filter on constantly — almost always a date. Now a query for "last 7 days" touches 7 partitions instead of years of history. This is called partition pruning, and it's often the difference between a query scanning 10 GB and 10 TB. Rule of thumb: partition by time.

2. Clustering / Z-Ordering — Sort Within Partitions

Inside each partition, sort the data by your next most common filter (customer_id, region, product). When rows you query together are physically stored together, the engine reads fewer files/blocks to find them. Clustering refines what partitioning starts.

3. The Failure Mode: Over-Partitioning

More partitions is not better. Partition by a high-cardinality column (like user_id) and you get millions of tiny files — the "small files problem" — which actually slows queries and bloats metadata. Partition by something low-cardinality and coarse (day, sometimes month); cluster by the high-cardinality stuff. Getting this balance right is a core data-engineering skill.

Partition Pruning, Visualized

Partitioning skips whole folders; clustering trims what's read inside the surviving one.

Same Pattern, Every Platform

The Lever	Snowflake	Databricks (Delta)	BigQuery	Microsoft Fabric
Partitioning	Automatic micro-partitions	Partition columns / Liquid Clustering	Partitioned tables (by date/range)	Delta partitions (Spark/SQL)
Clustering	Clustering keys	Z-ORDER / Liquid Clustering	Clustering columns	Z-ORDER (Delta)
Goal	Read less → faster & cheaper			Same

Feature names evolve — treat this as a capability map, and confirm specifics against current vendor docs.

The takeaway: every platform gives you the same two levers — partition by your coarse, frequent filter (time); cluster by your fine one (id, region). The names differ (clustering keys, Z-order, Liquid Clustering), but the goal is identical: make the engine read less. It's usually the #1 cost fix on a data team.

← Part 8 Open Table Formats Part 10 → Data Quality & Testing

← Back to Publications