← Back to Descriptive Analytics

Segmentation & Clustering

Finding the Groups Hidden in the Data

A single average hides more than it reveals. The "average customer" rarely exists — what exists are distinct groups that behave differently. Segmentation splits the data into meaningful groups using rules you define (region, customer type, spend tier). Clustering goes further: it lets an algorithm discover natural groupings you did not pre-define. Both turn one blurry population into several sharp, comparable cohorts.

One population, several natural groups All customers one blurry average High value low discount Mid value steady Discount-driven thin margin

Segmentation and clustering split one population into cohorts that can finally be compared like-for-like.

Segmentation vs Clustering

How K-Means Works (in one paragraph)

K-Means picks k centers, assigns each row to its nearest center, recomputes the centers as the mean of their members, and repeats until the assignments stop moving. Features must be scaled first (so that "sales in dollars" doesn't dominate "quantity in units"), and you choose k with the elbow method or a silhouette score.

In Python — rule-based tiers and K-Means

Python · pandas + scikit-learn

import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

# Build a customer-level table from the order lines
cust = (orders.groupby("customer_id")
              .agg(total_sales=("sales", "sum"),
                   orders=("order_id", "nunique"),
                   avg_discount=("discount", "mean"))
              .reset_index())

# Rule-based segmentation: spend tiers via quantiles
cust["tier"] = pd.qcut(cust["total_sales"], q=3,
                       labels=["Low", "Mid", "High"])

# Algorithmic clustering: scale, then K-Means on 3 features
X = StandardScaler().fit_transform(
        cust[["total_sales", "orders", "avg_discount"]])
cust["cluster"] = KMeans(n_clusters=3, n_init=10,
                         random_state=42).fit_predict(X)

print(cust.groupby("cluster")[["total_sales", "orders", "avg_discount"]].mean())

In SQL — quantile tiers with NTILE

SQL

-- Rule-based segmentation is natural in SQL with NTILE / CASE.
-- (True clustering like K-Means lives in Python or a warehouse ML add-on.)
SELECT
  customer_id,
  SUM(sales) AS total_sales,
  COUNT(DISTINCT order_id) AS orders,
  NTILE(3) OVER (ORDER BY SUM(sales)) AS spend_tier,   -- 1=low, 3=high
  CASE WHEN AVG(discount) > 0.20 THEN 'Discount-driven'
       ELSE 'Full-price' END AS price_behavior
FROM orders
GROUP BY customer_id;

Tools Commonly Used

Best Practices

Example: Retail Customer Segments

In the retail dataset, segmentation and clustering might produce:

← Back to Descriptive Analytics