Lesson 5 — Data Cleaning, Preprocessing, and Merging Datasets
(from raw to analysis-ready)
Why this matters¶
Most of the time in real analytics is spent before modeling:
fixing inconsistent formats,
resolving category labels,
handling missing values,
ensuring the unit of observation is correct,
and merging multiple data sources safely.
Good cleaning is not “cosmetic.” It changes your results.
Today’s lab datasets¶
We will use two CSV files:
Both files are essentially country-level data, but most variables are stored in wide format (years appear in column names such as _2023, _2022, etc.).
To merge them properly, we will convert each file into a country–year panel.
Intended unit of observation (after cleaning)¶
One row = one country-year
Key columns:
flagCode+year(andcountryas a label)
The course-standard cleaning pipeline¶
Type check: numbers as numbers, dates as dates, categories as categories
Range check: impossible values (negative rates, nonsensical GDP values, etc.)
Uniqueness check: duplicates in keys (
flagCode,year)Missingness check: overall + by key variables
Consistency check: country codes/names match across files
Document everything: a short cleaning log
Wide vs long (why reshaping matters)¶
Many public datasets are reported in wide form:
DivorcesPer1kPop_2023,DivorcesPer1kPop_2022, ...
But analysis (and merging) is usually easier in long form:
flagCode,country,year,DivorcesPer1kPop, ...
Merging datasets (the professional habit)¶
In real projects, your analysis-ready dataset is often the result of a merge:
outcomes + covariates,
macro indicators + social indicators,
business KPIs + external context.
Join types (keep it simple)¶
Left join (recommended default): keep all rows from the main dataset and attach what matches from the other dataset.
Inner join: keep only matched rows (risk: silent data loss).
Outer join: keep all rows from both datasets (best for diagnosing mismatch).
Merge validation (required)¶
After merging, always check:
Row count before vs after
Unmatched records (new missing values created)
Duplicates created unexpectedly
Key uniqueness still holds
Mini-lab (Google Colab)¶
In-class checkpoints¶
A. Load and inspect
Load both CSV files into DataFrames.
Run
df.info()and identify at least two issues (types, missingness, column naming, etc.).
B. Reshape divorce data to country–year
3. Convert divorce-rates-by-country-2026.csv from wide to long so you have:
flagCode,country,year, and the variables:DivorcesPer1kPopNumberOfDivorcesMarriagesPer1kPopNumberOfMarriages
C. Reshape GDP per capita data to country–year
4. Convert gdp-per-capita-by-country-2026.csv from wide to long so you have:
flagCode,country,year, and at least:GDPPerCapitaViaWB(Optional: also reshape PPP variables if you want.)
D. Merge
5. Merge the two long datasets on flagCode + year (left join recommended).
6. Validate the merge:
row counts before/after,
unmatched rows introduced,
duplicates created,
key uniqueness.
E. Feature engineering 7. Create at least two features, for example:
log_gdp_pc = log(GDPPerCapitaViaWB)complete = 1if bothDivorcesPer1kPopandGDPPerCapitaViaWBare non-missing, else0
Produce a summary table:
number of observations,
number complete,
missingness of key variables.
F. Cleaning Log 9. Write a short cleaning log (template at the end) in your notebook.
Python patterns (copy/paste)¶
0) Setup and loading¶
import pandas as pd
import numpy as np
import re
div = pd.read_csv("divorce-rates-by-country-2026.csv")
gdp = pd.read_csv("gdp-per-capita-by-country-2026.csv")Quick checks you should always run¶
div.shape, gdp.shape
div.isna().mean().sort_values(ascending=False).head(10)
gdp.isna().mean().sort_values(ascending=False).head(10)1) Reshape the divorce dataset (wide → long panel)¶
The divorce file contains repeated year-suffixed columns for several variable stubs:
DivorcesPer1kPop_YYYYNumberOfDivorces_YYYYMarriagesPer1kPop_YYYYNumberOfMarriages_YYYY
div_year_cols = [c for c in div.columns if re.search(r"_\d{4}$", c)]
div_long = div.melt(
id_vars=["flagCode", "country"],
value_vars=div_year_cols,
var_name="var_year",
value_name="value"
)
div_long["year"] = div_long["var_year"].str.extract(r"(\d{4})").astype(int)
div_long["variable"] = div_long["var_year"].str.replace(r"_\d{4}$", "", regex=True)Pivot to one row per country-year:
div_panel = (div_long.pivot_table(
index=["flagCode", "country", "year"],
columns="variable",
values="value",
aggfunc="first"
).reset_index())Key validation:
div_panel.duplicated(subset=["flagCode", "year"]).sum()2) Reshape the GDP dataset (wide → long panel)¶
The GDP file has columns like:
GDPPerCapitaViaWB_YYYYGDPPerCapitaPPPIntViaWB_YYYY(optional)
Start with GDP per capita (WB):
gdp_year_cols = [c for c in gdp.columns if c.startswith("GDPPerCapitaViaWB_")]
gdp_long = gdp.melt(
id_vars=["flagCode", "country"],
value_vars=gdp_year_cols,
var_name="var_year",
value_name="GDPPerCapitaViaWB"
)
gdp_long["year"] = gdp_long["var_year"].str.extract(r"(\d{4})").astype(int)
gdp_panel = gdp_long[["flagCode", "country", "year", "GDPPerCapitaViaWB"]].copy()Key validation:
gdp_panel.duplicated(subset=["flagCode", "year"]).sum()3) Merge + validation¶
merged = div_panel.merge(
gdp_panel,
on=["flagCode", "year"],
how="left",
validate="one_to_one",
suffixes=("", "_gdp")
)
# Keep a clean country name (prefer divorce country if present)
if "country_gdp" in merged.columns:
merged["country"] = merged["country"].fillna(merged["country_gdp"])
merged = merged.drop(columns=["country_gdp"])Merge checks:
print("div_panel:", div_panel.shape, "gdp_panel:", gdp_panel.shape, "merged:", merged.shape)
print("\nTop missingness after merge:")
print(merged.isna().mean().sort_values(ascending=False).head(12))
print("\nDuplicates on key flagCode-year:", merged.duplicated(subset=["flagCode", "year"]).sum())4) Feature engineering¶
merged["log_gdp_pc"] = np.log(merged["GDPPerCapitaViaWB"])
merged["complete"] = (
merged["GDPPerCapitaViaWB"].notna() &
merged["DivorcesPer1kPop"].notna()
).astype(int)Quick summary table:
summary = pd.DataFrame({
"n_rows": [len(merged)],
"n_complete": [merged["complete"].sum()],
"share_complete": [merged["complete"].mean()]
})
summaryCleaning Log (required)¶
Copy this into your notebook and fill it in:
Files used:
Unit of observation: (before / after reshape)
Key columns used for merge:
Cleaning steps performed: (bullets)
Reshaping performed? (wide→long; how?)
Join type used + why:
Merge validation results: (row counts; unmatched; duplicates)
Features created:
One limitation / risk: (e.g., missingness, comparability, measurement)
AI check (responsible use)¶
Review questions (quiz / reflection)¶
Why can an inner join silently change your conclusions?
What checks would reveal a one-to-many “merge explosion”?
Why is long form often easier for analysis than wide form?
What is one way cleaning choices can introduce bias?