Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Lesson 5 — Data Cleaning, Preprocessing, and Merging Datasets

(from raw to analysis-ready)

Why this matters

Most of the time in real analytics is spent before modeling:

Good cleaning is not “cosmetic.” It changes your results.


Today’s lab datasets

We will use two CSV files:

Both files are essentially country-level data, but most variables are stored in wide format (years appear in column names such as _2023, _2022, etc.).
To merge them properly, we will convert each file into a country–year panel.

Intended unit of observation (after cleaning)


The course-standard cleaning pipeline

  1. Type check: numbers as numbers, dates as dates, categories as categories

  2. Range check: impossible values (negative rates, nonsensical GDP values, etc.)

  3. Uniqueness check: duplicates in keys (flagCode, year)

  4. Missingness check: overall + by key variables

  5. Consistency check: country codes/names match across files

  6. Document everything: a short cleaning log


Wide vs long (why reshaping matters)

Many public datasets are reported in wide form:

But analysis (and merging) is usually easier in long form:


Merging datasets (the professional habit)

In real projects, your analysis-ready dataset is often the result of a merge:

Join types (keep it simple)

Merge validation (required)

After merging, always check:

  1. Row count before vs after

  2. Unmatched records (new missing values created)

  3. Duplicates created unexpectedly

  4. Key uniqueness still holds


Mini-lab (Google Colab)

In-class checkpoints

A. Load and inspect

  1. Load both CSV files into DataFrames.

  2. Run df.info() and identify at least two issues (types, missingness, column naming, etc.).

B. Reshape divorce data to country–year 3. Convert divorce-rates-by-country-2026.csv from wide to long so you have:

C. Reshape GDP per capita data to country–year 4. Convert gdp-per-capita-by-country-2026.csv from wide to long so you have:

D. Merge 5. Merge the two long datasets on flagCode + year (left join recommended). 6. Validate the merge:

E. Feature engineering 7. Create at least two features, for example:

  1. Produce a summary table:

    • number of observations,

    • number complete,

    • missingness of key variables.

F. Cleaning Log 9. Write a short cleaning log (template at the end) in your notebook.


Python patterns (copy/paste)

0) Setup and loading

import pandas as pd
import numpy as np
import re

div = pd.read_csv("divorce-rates-by-country-2026.csv")
gdp = pd.read_csv("gdp-per-capita-by-country-2026.csv")

Quick checks you should always run

div.shape, gdp.shape
div.isna().mean().sort_values(ascending=False).head(10)
gdp.isna().mean().sort_values(ascending=False).head(10)

1) Reshape the divorce dataset (wide → long panel)

The divorce file contains repeated year-suffixed columns for several variable stubs:

div_year_cols = [c for c in div.columns if re.search(r"_\d{4}$", c)]

div_long = div.melt(
    id_vars=["flagCode", "country"],
    value_vars=div_year_cols,
    var_name="var_year",
    value_name="value"
)

div_long["year"] = div_long["var_year"].str.extract(r"(\d{4})").astype(int)
div_long["variable"] = div_long["var_year"].str.replace(r"_\d{4}$", "", regex=True)

Pivot to one row per country-year:

div_panel = (div_long.pivot_table(
    index=["flagCode", "country", "year"],
    columns="variable",
    values="value",
    aggfunc="first"
).reset_index())

Key validation:

div_panel.duplicated(subset=["flagCode", "year"]).sum()

2) Reshape the GDP dataset (wide → long panel)

The GDP file has columns like:

Start with GDP per capita (WB):

gdp_year_cols = [c for c in gdp.columns if c.startswith("GDPPerCapitaViaWB_")]

gdp_long = gdp.melt(
    id_vars=["flagCode", "country"],
    value_vars=gdp_year_cols,
    var_name="var_year",
    value_name="GDPPerCapitaViaWB"
)

gdp_long["year"] = gdp_long["var_year"].str.extract(r"(\d{4})").astype(int)
gdp_panel = gdp_long[["flagCode", "country", "year", "GDPPerCapitaViaWB"]].copy()

Key validation:

gdp_panel.duplicated(subset=["flagCode", "year"]).sum()

3) Merge + validation

merged = div_panel.merge(
    gdp_panel,
    on=["flagCode", "year"],
    how="left",
    validate="one_to_one",
    suffixes=("", "_gdp")
)

# Keep a clean country name (prefer divorce country if present)
if "country_gdp" in merged.columns:
    merged["country"] = merged["country"].fillna(merged["country_gdp"])
    merged = merged.drop(columns=["country_gdp"])

Merge checks:

print("div_panel:", div_panel.shape, "gdp_panel:", gdp_panel.shape, "merged:", merged.shape)

print("\nTop missingness after merge:")
print(merged.isna().mean().sort_values(ascending=False).head(12))

print("\nDuplicates on key flagCode-year:", merged.duplicated(subset=["flagCode", "year"]).sum())

4) Feature engineering

merged["log_gdp_pc"] = np.log(merged["GDPPerCapitaViaWB"])

merged["complete"] = (
    merged["GDPPerCapitaViaWB"].notna() &
    merged["DivorcesPer1kPop"].notna()
).astype(int)

Quick summary table:

summary = pd.DataFrame({
    "n_rows": [len(merged)],
    "n_complete": [merged["complete"].sum()],
    "share_complete": [merged["complete"].mean()]
})
summary

Cleaning Log (required)

Copy this into your notebook and fill it in:


AI check (responsible use)


Review questions (quiz / reflection)

  1. Why can an inner join silently change your conclusions?

  2. What checks would reveal a one-to-many “merge explosion”?

  3. Why is long form often easier for analysis than wide form?

  4. What is one way cleaning choices can introduce bias?