Lesson 6 — Correlation, Simple & Multiple Regression (interpretation for business/economics)
Why this matters¶
Regression is one of the most widely used tools in business and economics because it helps quantify relationships:
Are countries with higher income associated with higher (or lower) divorce outcomes?
Are higher prices associated with lower demand?
Which factors are correlated with churn or customer satisfaction?
But regression is easy to misuse—especially when we confuse association with causation.
Where regression sits in the AI/ML/DS map¶
The regression mindset: from question to model¶
Today’s running example (new dataset)¶
We use:
divorce
-raw -1992 .csv — divorce data Unit: country (2019 cross-section)
Outcome : divorce–marriage ratio
d_rate = divorce_rate / marriage_rateMain predictor : income (rescaled GDP per capita)
income_pc = gdp_pc / 1000Control candidate: GDP growth (
gdp_gr)
Step 0: Start with a picture (scatter plot)¶
Before any equation, draw the relationship.
If the scatter is basically a cloud: the relationship may be weak or non-linear.
If there is a visible upward/downward pattern: regression helps summarize it.
Step 1: Correlation (a warm-up)¶
Correlation is a standardized measure of linear association.
It ranges from -1 to +1.
It is symmetric: .
It does not tell you “how much changes when changes” in units.
Step 2: Simple regression (one predictor)¶
A simple regression estimates a line:
tells you how changes (on average) with a one-unit change in .
captures everything else not in the model.
Our example (simple regression)¶
:
d_rate:
income_pc
Interpretation template:
“A one-unit increase in
income_pc(i.e., dollars of GDP per capita) is associated with a change ind_rate, on average, in this sample.”
Step 3: Multiple regression (adding a control)¶
Multiple regression adds additional predictors:
Now is interpreted as the association between and holding other included variables constant.
Why add controls?¶
To reduce confounding (informally):
If both income level and macroeconomic conditions are related to divorce/marriage patterns, a simple regression may mix these relationships.
Today we add:
gdp_gr(GDP growth) as a simple control.
How to read regression output (what we focus on)¶
Coefficients
sign (positive/negative association)
magnitude (how much change in per unit of )
units matter
Uncertainty (standard errors / confidence intervals)
How precise is the estimate?
Fit (R-squared)
How much of the variation in is explained (in-sample)?
Do not treat as “truth” or “usefulness” by itself.
Regression and causality (a careful note)¶
Regression describes a relationship. Causal interpretation requires stronger assumptions/design.
Common threats:
Omitted variables
Reverse causality
Measurement error
Selection bias
Mini-lab (Google Colab)¶
In-class checkpoints¶
Load
divorce_raw.csvand filter to year = 2019.Create:
income_pc = gdp_pc / 1000d_rate = divorce_rate / marriage_rate * 1000
Make a scatter plot:
d_ratevsincome_pc.Compute correlation between
d_rateandincome_pc.Fit simple regression:
d_rate ~ income_pcand interpret .Fit multiple regression:
d_rate ~ income_pc + gdp_grand interpret how changes (if it does).Make one diagnostic plot (residuals vs fitted OR histogram of residuals).
Write a short “manager memo” (5–7 lines): headline + evidence + caveat.
Submission (after class)¶
Share the Colab link (view permission) or export to PDF.
Include your memo + one caveat as Markdown.