Segmentation & Clustering
Finding the Groups Hidden in the Data
A single average hides more than it reveals. The "average customer" rarely exists — what exists are distinct groups that behave differently. Segmentation splits the data into meaningful groups using rules you define (region, customer type, spend tier). Clustering goes further: it lets an algorithm discover natural groupings you did not pre-define. Both turn one blurry population into several sharp, comparable cohorts.
Segmentation and clustering split one population into cohorts that can finally be compared like-for-like.
Segmentation vs Clustering
- Segmentation (rule-based): you decide the boundaries — region, segment, or spend tier via quantiles. Transparent and easy to explain to stakeholders.
- Clustering (algorithmic): an unsupervised algorithm such as K-Means groups rows by similarity across several features at once, surfacing patterns no single rule would catch.
How K-Means Works (in one paragraph)
K-Means picks k centers, assigns each row to its nearest center, recomputes the centers as
the mean of their members, and repeats until the assignments stop moving. Features must be
scaled first (so that "sales in dollars" doesn't dominate "quantity in units"), and you
choose k with the elbow method or a silhouette score.
In Python — rule-based tiers and K-Means
Python · pandas + scikit-learn
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
# Build a customer-level table from the order lines
cust = (orders.groupby("customer_id")
.agg(total_sales=("sales", "sum"),
orders=("order_id", "nunique"),
avg_discount=("discount", "mean"))
.reset_index())
# Rule-based segmentation: spend tiers via quantiles
cust["tier"] = pd.qcut(cust["total_sales"], q=3,
labels=["Low", "Mid", "High"])
# Algorithmic clustering: scale, then K-Means on 3 features
X = StandardScaler().fit_transform(
cust[["total_sales", "orders", "avg_discount"]])
cust["cluster"] = KMeans(n_clusters=3, n_init=10,
random_state=42).fit_predict(X)
print(cust.groupby("cluster")[["total_sales", "orders", "avg_discount"]].mean())
In SQL — quantile tiers with NTILE
SQL
-- Rule-based segmentation is natural in SQL with NTILE / CASE.
-- (True clustering like K-Means lives in Python or a warehouse ML add-on.)
SELECT
customer_id,
SUM(sales) AS total_sales,
COUNT(DISTINCT order_id) AS orders,
NTILE(3) OVER (ORDER BY SUM(sales)) AS spend_tier, -- 1=low, 3=high
CASE WHEN AVG(discount) > 0.20 THEN 'Discount-driven'
ELSE 'Full-price' END AS price_behavior
FROM orders
GROUP BY customer_id;
Tools Commonly Used
- Python: Pandas
qcut/cut, scikit-learnKMeans,DBSCAN,StandardScaler. - SQL:
NTILE,CASE, andGROUP BYfor rule-based tiers. - Power BI / Tableau: grouping, binning, and built-in clustering on a scatter plot.
Best Practices
- Scale features before clustering so large-magnitude columns don't dominate the distance.
- Choose
kdeliberately (elbow / silhouette), not by guessing. - Give clusters business names ("Discount-driven", "Loyal high-value") — a numbered cluster means nothing to a stakeholder.
- Validate that segments are actually different — if two groups behave identically, you have too many.
Example: Retail Customer Segments
In the retail dataset, segmentation and clustering might produce:
- Three spend tiers (Low / Mid / High) from a simple quantile rule for reporting.
- A K-Means model that surfaces a "discount-driven, thin-margin" cluster hidden inside the Mid tier.
- Region- and segment-level cohorts that feed the comparisons and visuals in the next steps.