Lesson 5 — Data Cleaning & Preprocessing (from raw to analysis-ready)
Why this matters (motivation)¶
Most of the time in real analytics is spent before modeling:
fixing inconsistent input formats,
resolving category labels,
handling missing values,
and ensuring comparisons are meaningful.
Good cleaning is not “cosmetic.” It changes your results.
The cleaning mindset: “What would break trust?”¶
A simple cleaning pipeline (the course standard)¶
Type check: numbers as numbers, dates as dates, categories as categories
Range check: impossible values (negative sales, invalid ages, etc.)
Uniqueness check: duplicates in IDs or key fields
Missingness check: overall and by group (region/segment/time)
Consistency check: category labels, units, and definitions
Document everything: a short cleaning log
Common cleaning problems (and practical fixes)¶
1) Data types and parsing¶
Examples:
datestored as a stringcurrency stored as
"1,200"(string with commas)numeric column includes
"N/A"or"—"
2) Missing values (what to do depends on why)¶
Missingness can be:
rare noise (logging errors),
structural (not applicable),
systematic (missing more for a particular group).
Practical options (in increasing strength):
leave missing values and be explicit about sample size,
drop rows only if missingness is small and plausibly random,
impute using simple rules (median, last observation, etc.) with justification,
create a missingness indicator (a flag) when missingness itself is informative.
3) Duplicates and unit of observation¶
Always confirm the unit:
customer-level?
transaction-level?
day-level?
Duplicates may be:
exact duplicates (same row repeated),
key duplicates (same
customer_id + daterepeated).
4) Outliers (don’t delete automatically)¶
Outliers might be:
true extreme behavior (high-value customers),
measurement error,
unit problems (e.g., 1,000 vs 1.000).
A good approach:
flag outliers,
inspect a small number of extreme cases,
justify whether you keep, cap (winsorize), or exclude.
Feature engineering (simple and meaningful)¶
Examples:
growth rate:
sales_growth = (sales_t - sales_{t-1}) / sales_{t-1}log transform for skewed variables:
log_sales = log(1 + sales)flags:
high_value_customer,promotion_period,missing_income_flagtime features: month/quarter, day-of-week (if relevant)
Mini case: cleaning a “messy sales” dataset¶
Scenario: You have sales records with:
inconsistent category labels (“BKK”, “Bangkok”, “Bangkok ”),
dates stored as strings in mixed formats,
missing marketing spend for some regions,
extreme sales values.
Goal: Produce an analysis-ready dataset for regression and visualization later.
Deliverables from cleaning:
A cleaned
DataFramewith correct typesA missingness summary (overall + by region/category)
A small outlier check
2–3 engineered features (e.g., log_sales, growth, flags)
A short Cleaning Log
Mini-lab (Google Colab)¶
In-class checkpoints
Run
df.info()and identify at least two columns with incorrect types.Produce a missingness table (overall and by one grouping variable).
Check duplicates using a defined key (e.g.,
customer_id + dateorstore_id + date).Identify possible outliers and decide a rule (keep / flag / cap) with a brief justification.
Create at least two engineered features and show their distributions.
Submission (after class)
Share the Colab link (view permission) or export to PDF.
Include your Cleaning Log as a Markdown cell.
AI check (responsible use for cleaning)¶
Good prompt examples
“Given these columns and common issues (mixed date formats, category typos), propose a cleaning checklist.”
“Show pandas code to standardize category labels and parse dates safely.”
“How can I summarize missingness by region and year?”
Bad prompt example
“Clean my dataset and tell me the results” (without showing steps and checks)
Review questions (quiz / reflection)¶
Why can dropping missing values change your conclusions?
Give one case where an outlier might be real and important (not an error).
What is the unit of observation in your dataset, and how does that affect duplicate checks?