← Back to Series Overview
Learn the Pattern · Part 9

Partitioning & Clustering

The cheapest, highest-leverage way to make queries fast and bills small — and it comes down to one idea: make the engine read less.

In 60 seconds

Imagine a library with no shelves — every book in one giant pile. Finding one means reading them all. That's an unpartitioned table.

  1. Partitioning — split data into folders by a column (usually date).
  2. Result — query "last 7 days" and the engine skips ~99% of the data. Less scanned = faster AND cheaper.
  3. Clustering / Z-ordering — sort data inside each partition.
  4. Result — co-locates rows you filter together (e.g., by customer_id), so even less data is read.
  5. The mental model — engines are fast because they READ LESS. Partition by what you filter on most.

Speed Comes from Reading Less

Every cloud query engine has the same fundamental cost: how much data it has to scan. Pricing on platforms like BigQuery is literally bytes-scanned; on warehouse compute it's time-on-engine, which is also driven by data read. So the highest-leverage optimization is almost never "a faster engine" — it's arranging the data so the engine can skip most of it.

Two techniques do this, and they stack. Partitioning physically separates rows into segments by a column. Clustering (a.k.a. Z-ordering) sorts the rows within those segments. Together they let the engine prune away everything irrelevant to your filter before it reads a single byte of it.

The Two Levers

1. Partitioning — Divide by Your Most Common Filter

Split the table into segments based on a column you filter on constantly — almost always a date. Now a query for "last 7 days" touches 7 partitions instead of years of history. This is called partition pruning, and it's often the difference between a query scanning 10 GB and 10 TB. Rule of thumb: partition by time.

2. Clustering / Z-Ordering — Sort Within Partitions

Inside each partition, sort the data by your next most common filter (customer_id, region, product). When rows you query together are physically stored together, the engine reads fewer files/blocks to find them. Clustering refines what partitioning starts.

3. The Failure Mode: Over-Partitioning

More partitions is not better. Partition by a high-cardinality column (like user_id) and you get millions of tiny files — the "small files problem" — which actually slows queries and bloats metadata. Partition by something low-cardinality and coarse (day, sometimes month); cluster by the high-cardinality stuff. Getting this balance right is a core data-engineering skill.

Partition Pruning, Visualized

Skip the partitions you don't need — read less 🔎 Query WHERE date = '2026-06-10' AND customer_id = 42 Partitioned + clustered table 📁 2026-06-08 — skipped 📁 2026-06-09 — skipped 📁 2026-06-10 sorted by customer_id ✅ tiny read fast & cheap partition pruning clustering

Partitioning skips whole folders; clustering trims what's read inside the surviving one.

Same Pattern, Every Platform

The LeverSnowflakeDatabricks (Delta)BigQueryMicrosoft Fabric
PartitioningAutomatic micro-partitionsPartition columns / Liquid ClusteringPartitioned tables (by date/range)Delta partitions (Spark/SQL)
ClusteringClustering keysZ-ORDER / Liquid ClusteringClustering columnsZ-ORDER (Delta)
GoalRead less → faster & cheaperSame

Feature names evolve — treat this as a capability map, and confirm specifics against current vendor docs.

The takeaway: every platform gives you the same two levers — partition by your coarse, frequent filter (time); cluster by your fine one (id, region). The names differ (clustering keys, Z-order, Liquid Clustering), but the goal is identical: make the engine read less. It's usually the #1 cost fix on a data team.

← Back to Publications