Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Lesson 8 — Unsupervised Learning: Clustering & PCA

(segmentation, dimensionality reduction, and interpretation for business/economics)

Why this matters (motivation)

In business and economics, you often don’t have a labeled outcome:

Unsupervised learning helps you:


What is unsupervised learning?

Two main tools today

  1. Clustering: group similar observations into clusters (segments)

  2. PCA: reduce many variables into a small number of components


Part A — Clustering for segmentation

1) What clustering does (intuition)

Clustering groups observations based on similarity.

Typical business examples:

2) Scaling matters (a lot)

If one variable has a large numeric scale (e.g., income in dollars), it can dominate distance calculations unless you standardize.

3) Choosing the number of clusters (k)

There is no single “correct” k. Practical methods:

4) Cluster labeling (the “human step”)

Once you have clusters, you must interpret them:


Part B — PCA for dimensionality reduction

1) Why PCA exists

When you have many variables, patterns are hard to see:

PCA creates new variables (“components”) that summarize variation.

2) Interpreting PCA (variance explained + loadings)

Two things matter:

Interpretation idea:

3) PCA and segmentation together

A practical workflow:

  1. use PCA to reduce dimensions (e.g., 20 variables → 2–5 components)

  2. cluster using the components

  3. interpret clusters back in original variables

This often improves stability and makes plotting easier.


Mini case: customer segmentation (typical structure)

Dataset idea: customer-level data with:

Question: “Are there distinct segments, and what should we do differently for each?”

Segmentation outputs should lead to actions:


Visual communication (what you should show)

For clustering:

For PCA:


Mini-lab (Google Colab)

In-class checkpoints (Clustering)

  1. Select 3–6 numeric variables for clustering and justify choice.

  2. Standardize variables (or explain why not).

  3. Run k-means for at least 3 candidate values of k (e.g., k = 2, 3, 4, 5).

  4. Choose k using at least one diagnostic (elbow and/or silhouette) + a business argument.

  5. Produce:

    • cluster sizes

    • cluster summary table (means or medians)

    • one visualization colored by cluster

In-class checkpoints (PCA)

  1. Run PCA on the same variables (standardized).

  2. Report explained variance for the first components.

  3. Interpret Component 1 and Component 2 using loadings (in words).

  4. Plot data in PC1–PC2 space; optionally cluster in PC space.

Submission (after class)


AI check (responsible use for unsupervised learning)

Good prompt examples

Bad prompt example


Review questions (quiz / reflection)

  1. Why is standardization important for k-means clustering?

  2. Why is there no single “correct” number of clusters?

  3. What does “explained variance” mean in PCA?

  4. What is one responsible caveat you should include when presenting clusters?