Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Lesson 2 — Data Literacy & Exploratory Data Analysis (EDA)

Why this matters (motivation)

Most mistakes in analytics happen before modeling:

EDA is your “first conversation” with the data.


The EDA mindset: what questions are we asking?

A simple EDA checklist (what we do every time)

  1. Preview: rows, columns, data types

  2. Validate: missing values, duplicates, impossible values

  3. Summarize: distributions (center + spread)

  4. Compare: by group (category, time, segment)

  5. Visualize: choose the right chart for the question

  6. Write: 3–5 plain-language insights + 1 caveat


Data literacy essentials (types and pitfalls)

Data types that matter

Missing values: what they might mean

Missingness can be:


Descriptive statistics that students actually use

Core summary numbers

What to report (rule of thumb)


Visualization as a decision tool (not decoration)

Quick chart chooser (practical)


Mini case: “Sales and marketing” dataset (EDA story)

Question: “Which product categories grew, and is growth associated with marketing spend?”

EDA steps:

  1. Check data types (date as datetime, spend as numeric)

  2. Summarize missing values (overall + by category)

  3. Plot distribution of sales (is it skewed?)

  4. Plot sales over time (overall + by category)

  5. Scatter: sales vs marketing (colored by category)

  6. Write 3 insights + 1 caveat


Mini-lab (Google Colab)

Churn Data: https://drive.google.com/file/d/1CcOCXyGM9GLFmzFbKfmIhioM-yLDkDqU/view?usp=sharing

In-class tasks (checkpoints)

  1. Print data types and basic shape

  2. Create a missingness summary table

  3. Produce: (i) histogram/boxplot, (ii) line chart, (iii) bar chart by category

  4. Submit 3 insights + 1 caveat in markdown inside the notebook

Submission


AI check (responsible use for EDA)

Good prompt examples

Bad prompt example


Review questions (quiz / reflection)

  1. When would you report median and IQR instead of mean and SD?

  2. Give one example of systematic missingness in a business dataset.

  3. Which chart would you choose to show: (i) trend, (ii) distribution, (iii) group comparison—and why?

Kindly submit Reflection prompts for Lesson 1 and Lesson 2 here (before next class). https://docs.google.com/forms/d/e/1FAIpQLSfsq2ln4ru9G1Mtt8Huezzl52RbNw0SFnXyeMj9g2wuudt8UQ/viewform?usp=publish-editor