Lesson 6 — Correlation, Simple & Multiple Regression (interpretation for business/economics)

Why this matters¶

Regression is one of the most widely used tools in business and economics because it helps quantify relationships:

Are countries with higher income associated with higher (or lower) divorce outcomes?
Are higher prices associated with lower demand?
Which factors are correlated with churn or customer satisfaction?

But regression is easy to misuse—especially when we confuse association with causation.

Where regression sits in the AI/ML/DS map¶

The regression mindset: from question to model¶

Today’s running example (new dataset)¶

We use:

divorce-raw-1992.csv — divorce data
Unit: country (2019 cross-section)
Outcome $Y$ : divorce–marriage ratio
d_rate = divorce_rate / marriage_rate
Main predictor $X$ : income (rescaled GDP per capita)
income_pc = gdp_pc / 1000
Control candidate: GDP growth (gdp_gr)

Step 0: Start with a picture (scatter plot)¶

Before any equation, draw the relationship.

If the scatter is basically a cloud: the relationship may be weak or non-linear.
If there is a visible upward/downward pattern: regression helps summarize it.

Step 1: Correlation (a warm-up)¶

Correlation is a standardized measure of linear association.

It ranges from -1 to +1.
It is symmetric: $\mathrm{corr}(X,Y)=\mathrm{corr}(Y,X)$ .
It does not tell you “how much $Y$ changes when $X$ changes” in units.

Step 2: Simple regression (one predictor)¶

A simple regression estimates a line:

$Y = \beta_0 + \beta_1 X + \varepsilon$

$\beta_1$ tells you how $Y$ changes (on average) with a one-unit change in $X$ .
$\varepsilon$ captures everything else not in the model.

Our example (simple regression)¶

$Y$ : d_rate
$X$ : income_pc

Interpretation template:

“A one-unit increase in income_pc (i.e., $1{,}000$ dollars of GDP per capita) is associated with a $\beta_1$ change in d_rate, on average, in this sample.”

Step 3: Multiple regression (adding a control)¶

Multiple regression adds additional predictors:

$Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \varepsilon$

Now $\beta_1$ is interpreted as the association between $Y$ and $X_1$ holding other included variables constant.

Why add controls?¶

To reduce confounding (informally):

If both income level and macroeconomic conditions are related to divorce/marriage patterns, a simple regression may mix these relationships.

Today we add:

gdp_gr (GDP growth) as a simple control.

How to read regression output (what we focus on)¶

Coefficients

sign (positive/negative association)
magnitude (how much change in $Y$ per unit of $X$ )
units matter

Uncertainty (standard errors / confidence intervals)

How precise is the estimate?

Fit (R-squared)

How much of the variation in $Y$ is explained (in-sample)?
Do not treat $R^2$ as “truth” or “usefulness” by itself.

Regression and causality (a careful note)¶

Regression describes a relationship. Causal interpretation requires stronger assumptions/design.

Common threats:

Omitted variables
Reverse causality
Measurement error
Selection bias

Mini-lab (Google Colab)¶

In-class checkpoints¶

Load divorce_raw.csv and filter to year = 2019.
Create:
- income_pc = gdp_pc / 1000
- d_rate = divorce_rate / marriage_rate * 1000
Make a scatter plot: d_rate vs income_pc.
Compute correlation between d_rate and income_pc.
Fit simple regression: d_rate ~ income_pc and interpret $\beta_1$ .
Fit multiple regression: d_rate ~ income_pc + gdp_gr and interpret how $\beta_1$ changes (if it does).
Make one diagnostic plot (residuals vs fitted OR histogram of residuals).
Write a short “manager memo” (5–7 lines): headline + evidence + caveat.

Submission (after class)¶

Share the Colab link (view permission) or export to PDF.
Include your memo + one caveat as Markdown.