Segmentation & Clustering

Finding the Groups Hidden in the Data

A single average hides more than it reveals. The "average customer" rarely exists — what exists are distinct groups that behave differently. Segmentation splits the data into meaningful groups using rules you define (region, customer type, spend tier). Clustering goes further: it lets an algorithm discover natural groupings you did not pre-define. Both turn one blurry population into several sharp, comparable cohorts.

Segmentation and clustering split one population into cohorts that can finally be compared like-for-like.

Segmentation vs Clustering

Segmentation (rule-based): you decide the boundaries — region, segment, or spend tier via quantiles. Transparent and easy to explain to stakeholders.
Clustering (algorithmic): an unsupervised algorithm such as K-Means groups rows by similarity across several features at once, surfacing patterns no single rule would catch.

How K-Means Works (in one paragraph)

K-Means picks k centers, assigns each row to its nearest center, recomputes the centers as the mean of their members, and repeats until the assignments stop moving. Features must be scaled first (so that "sales in dollars" doesn't dominate "quantity in units"), and you choose k with the elbow method or a silhouette score.

In Python — rule-based tiers and K-Means

Python · pandas + scikit-learn

import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

# Build a customer-level table from the order lines
cust = (orders.groupby("customer_id")
              .agg(total_sales=("sales", "sum"),
                   orders=("order_id", "nunique"),
                   avg_discount=("discount", "mean"))
              .reset_index())

# Rule-based segmentation: spend tiers via quantiles
cust["tier"] = pd.qcut(cust["total_sales"], q=3,
                       labels=["Low", "Mid", "High"])

# Algorithmic clustering: scale, then K-Means on 3 features
X = StandardScaler().fit_transform(
        cust[["total_sales", "orders", "avg_discount"]])
cust["cluster"] = KMeans(n_clusters=3, n_init=10,
                         random_state=42).fit_predict(X)

print(cust.groupby("cluster")[["total_sales", "orders", "avg_discount"]].mean())

In SQL — quantile tiers with NTILE

SQL

-- Rule-based segmentation is natural in SQL with NTILE / CASE.
-- (True clustering like K-Means lives in Python or a warehouse ML add-on.)
SELECT
  customer_id,
  SUM(sales) AS total_sales,
  COUNT(DISTINCT order_id) AS orders,
  NTILE(3) OVER (ORDER BY SUM(sales)) AS spend_tier,   -- 1=low, 3=high
  CASE WHEN AVG(discount) > 0.20 THEN 'Discount-driven'
       ELSE 'Full-price' END AS price_behavior
FROM orders
GROUP BY customer_id;

Tools Commonly Used

Python: Pandas qcut/cut, scikit-learn KMeans, DBSCAN, StandardScaler.
SQL: NTILE, CASE, and GROUP BY for rule-based tiers.
Power BI / Tableau: grouping, binning, and built-in clustering on a scatter plot.

Best Practices

Scale features before clustering so large-magnitude columns don't dominate the distance.
Choose k deliberately (elbow / silhouette), not by guessing.
Give clusters business names ("Discount-driven", "Loyal high-value") — a numbered cluster means nothing to a stakeholder.
Validate that segments are actually different — if two groups behave identically, you have too many.

Example: Retail Customer Segments

In the retail dataset, segmentation and clustering might produce:

Three spend tiers (Low / Mid / High) from a simple quantile rule for reporting.
A K-Means model that surfaces a "discount-driven, thin-margin" cluster hidden inside the Mid tier.
Region- and segment-level cohorts that feed the comparisons and visuals in the next steps.

← Back to Descriptive Analytics