Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Lesson 5 — Data Cleaning & Preprocessing (from raw to analysis-ready)

Why this matters (motivation)

Most of the time in real analytics is spent before modeling:

Good cleaning is not “cosmetic.” It changes your results.


The cleaning mindset: “What would break trust?”

A simple cleaning pipeline (the course standard)

  1. Type check: numbers as numbers, dates as dates, categories as categories

  2. Range check: impossible values (negative sales, invalid ages, etc.)

  3. Uniqueness check: duplicates in IDs or key fields

  4. Missingness check: overall and by group (region/segment/time)

  5. Consistency check: category labels, units, and definitions

  6. Document everything: a short cleaning log


Common cleaning problems (and practical fixes)

1) Data types and parsing

Examples:

2) Missing values (what to do depends on why)

Missingness can be:

Practical options (in increasing strength):

3) Duplicates and unit of observation

Always confirm the unit:

Duplicates may be:

4) Outliers (don’t delete automatically)

Outliers might be:

A good approach:


Feature engineering (simple and meaningful)

Examples:


Mini case: cleaning a “messy sales” dataset

Scenario: You have sales records with:

Goal: Produce an analysis-ready dataset for regression and visualization later.

Deliverables from cleaning:

  1. A cleaned DataFrame with correct types

  2. A missingness summary (overall + by region/category)

  3. A small outlier check

  4. 2–3 engineered features (e.g., log_sales, growth, flags)

  5. A short Cleaning Log


Mini-lab (Google Colab)

In-class checkpoints

  1. Run df.info() and identify at least two columns with incorrect types.

  2. Produce a missingness table (overall and by one grouping variable).

  3. Check duplicates using a defined key (e.g., customer_id + date or store_id + date).

  4. Identify possible outliers and decide a rule (keep / flag / cap) with a brief justification.

  5. Create at least two engineered features and show their distributions.

Submission (after class)


AI check (responsible use for cleaning)

Good prompt examples

Bad prompt example


Review questions (quiz / reflection)

  1. Why can dropping missing values change your conclusions?

  2. Give one case where an outlier might be real and important (not an error).

  3. What is the unit of observation in your dataset, and how does that affect duplicate checks?