Lesson 8 — Unsupervised Learning: Clustering & PCA

(segmentation, dimensionality reduction, and interpretation for business/economics)

Why this matters (motivation)¶

In business and economics, you often don’t have a labeled outcome:

You may not know who will churn yet,
You may not have a single measure of “customer value,”
You might want to understand patterns in survey responses, behavior, or firm characteristics.

Unsupervised learning helps you:

discover structure (segments, clusters),
compress information (PCA),
and generate hypotheses for later modeling or strategy.

What is unsupervised learning?¶

Two main tools today¶

Clustering: group similar observations into clusters (segments)
PCA: reduce many variables into a small number of components

Part A — Clustering for segmentation¶

1) What clustering does (intuition)¶

Clustering groups observations based on similarity.

Typical business examples:

customer segmentation (spend, frequency, categories)
store segmentation (foot traffic, product mix)
firm clustering (size, leverage, growth, productivity)

2) Scaling matters (a lot)¶

If one variable has a large numeric scale (e.g., income in dollars), it can dominate distance calculations unless you standardize.

3) Choosing the number of clusters (k)¶

There is no single “correct” k. Practical methods:

elbow method (inertia vs k)
silhouette score (separation/compactness)
interpretability and usefulness (business reality check)

4) Cluster labeling (the “human step”)¶

Once you have clusters, you must interpret them:

summarize each cluster (mean/median of key variables)
compare sizes (how many customers in each)
give a simple label (e.g., “high spend / low frequency”)

Part B — PCA for dimensionality reduction¶

1) Why PCA exists¶

When you have many variables, patterns are hard to see:

many survey questions,
many product categories,
many macro indicators.

PCA creates new variables (“components”) that summarize variation.

2) Interpreting PCA (variance explained + loadings)¶

Two things matter:

explained variance ratio: how much information each component captures
loadings: which original variables contribute strongly to each component

Interpretation idea:

“Component 1 looks like an overall ‘economic development’ dimension”
“Component 2 looks like ‘urbanization vs agriculture’” (These labels are interpretive; you must justify them using loadings.)

3) PCA and segmentation together¶

A practical workflow:

use PCA to reduce dimensions (e.g., 20 variables → 2–5 components)
cluster using the components
interpret clusters back in original variables

This often improves stability and makes plotting easier.

Mini case: customer segmentation (typical structure)¶

Dataset idea: customer-level data with:

spending (total, categories),
frequency,
recency,
discounts used,
returns,
basic demographics (if available and appropriate)

Question: “Are there distinct segments, and what should we do differently for each?”

Segmentation outputs should lead to actions:

targeted marketing,
differentiated product bundles,
pricing/discount strategy,
service prioritization (but watch fairness concerns).

Visual communication (what you should show)¶

For clustering:

scatter plot of two features (or PC1 vs PC2) colored by cluster
bar chart of cluster sizes
table of cluster summaries (means/medians)

For PCA:

explained variance plot
loading table for first 2–3 components
scatter of observations in PC space

Mini-lab (Google Colab)¶

In-class checkpoints (Clustering)¶

Select 3–6 numeric variables for clustering and justify choice.
Standardize variables (or explain why not).
Run k-means for at least 3 candidate values of k (e.g., k = 2, 3, 4, 5).
Choose k using at least one diagnostic (elbow and/or silhouette) + a business argument.
Produce:
- cluster sizes
- cluster summary table (means or medians)
- one visualization colored by cluster

In-class checkpoints (PCA)¶

Run PCA on the same variables (standardized).
Report explained variance for the first components.
Interpret Component 1 and Component 2 using loadings (in words).
Plot data in PC1–PC2 space; optionally cluster in PC space.

Submission (after class)¶

Colab link (view permission) or PDF export.
Include:
- a short segmentation narrative (headline + evidence + action),
- and one limitation/caveat.

AI check (responsible use for unsupervised learning)¶

Good prompt examples

“Given this cluster summary table, suggest neutral labels and 1–2 business actions per cluster.”
“Given these PCA loadings, propose an interpretation for PC1 and PC2 and explain your reasoning.”
“What checks can I run to test whether my clusters are stable?”

Bad prompt example

“Tell me what these clusters ‘really mean’ about customers” (overclaiming)

Review questions (quiz / reflection)¶

Why is standardization important for k-means clustering?
Why is there no single “correct” number of clusters?
What does “explained variance” mean in PCA?
What is one responsible caveat you should include when presenting clusters?