Partitioning & Clustering
The cheapest, highest-leverage way to make queries fast and bills small — and it comes down to one idea: make the engine read less.
Imagine a library with no shelves — every book in one giant pile. Finding one means reading them all. That's an unpartitioned table.
- Partitioning — split data into folders by a column (usually date).
- Result — query "last 7 days" and the engine skips ~99% of the data. Less scanned = faster AND cheaper.
- Clustering / Z-ordering — sort data inside each partition.
- Result — co-locates rows you filter together (e.g., by customer_id), so even less data is read.
- The mental model — engines are fast because they READ LESS. Partition by what you filter on most.
Speed Comes from Reading Less
Every cloud query engine has the same fundamental cost: how much data it has to scan. Pricing on platforms like BigQuery is literally bytes-scanned; on warehouse compute it's time-on-engine, which is also driven by data read. So the highest-leverage optimization is almost never "a faster engine" — it's arranging the data so the engine can skip most of it.
Two techniques do this, and they stack. Partitioning physically separates rows into segments by a column. Clustering (a.k.a. Z-ordering) sorts the rows within those segments. Together they let the engine prune away everything irrelevant to your filter before it reads a single byte of it.
The Two Levers
1. Partitioning — Divide by Your Most Common Filter
Split the table into segments based on a column you filter on constantly — almost always a date. Now a query for "last 7 days" touches 7 partitions instead of years of history. This is called partition pruning, and it's often the difference between a query scanning 10 GB and 10 TB. Rule of thumb: partition by time.
2. Clustering / Z-Ordering — Sort Within Partitions
Inside each partition, sort the data by your next most common filter (customer_id, region, product). When rows you query together are physically stored together, the engine reads fewer files/blocks to find them. Clustering refines what partitioning starts.
3. The Failure Mode: Over-Partitioning
More partitions is not better. Partition by a high-cardinality column (like user_id) and you get millions of tiny files — the "small files problem" — which actually slows queries and bloats metadata. Partition by something low-cardinality and coarse (day, sometimes month); cluster by the high-cardinality stuff. Getting this balance right is a core data-engineering skill.
Partition Pruning, Visualized
Partitioning skips whole folders; clustering trims what's read inside the surviving one.
Same Pattern, Every Platform
| The Lever | Snowflake | Databricks (Delta) | BigQuery | Microsoft Fabric |
|---|---|---|---|---|
| Partitioning | Automatic micro-partitions | Partition columns / Liquid Clustering | Partitioned tables (by date/range) | Delta partitions (Spark/SQL) |
| Clustering | Clustering keys | Z-ORDER / Liquid Clustering | Clustering columns | Z-ORDER (Delta) |
| Goal | Read less → faster & cheaper | Same | ||
Feature names evolve — treat this as a capability map, and confirm specifics against current vendor docs.
The takeaway: every platform gives you the same two levers — partition by your coarse, frequent filter (time); cluster by your fine one (id, region). The names differ (clustering keys, Z-order, Liquid Clustering), but the goal is identical: make the engine read less. It's usually the #1 cost fix on a data team.