Lesson 2 — Data Literacy & Exploratory Data Analysis (EDA) - Research with AI

1.0 Why this matters (motivation)¶

Most mistakes in analytics happen before modeling:

confusing data types (e.g., numbers stored as strings),
ignoring missingness,
trusting outliers without checking,
using the wrong chart and telling the wrong story.

EDA is your “first conversation” with the data.

2.0 The EDA mindset: what questions are we asking?¶

2.1 A simple EDA checklist (what we do every time)¶

Preview: rows, columns, data types
Validate: missing values, duplicates, impossible values
Summarize: distributions (center + spread)
Compare: by group (category, time, segment)
Visualize: choose the right chart for the question
Write: 3–5 plain-language insights + 1 caveat

This EDA checklist forms part of a reproducible analytical workflow that we will use repeatedly throughout the course.

3.0 Data literacy essentials (types and pitfalls)¶

3.1 Data types that matter¶

Numeric: revenue, price, age
Categorical: region, product category, plan type
Datetime: purchase date, signup date
Identifiers: customer_id (not a “number” you average)

3.2 Missing values: what they might mean¶

Missingness can be:

random (e.g., occasional logging failures), or
systematic (e.g., income missing more for high-income respondents).

4.0 Descriptive statistics that students actually use¶

4.1 Core summary numbers¶

Mean: average level
Median: typical value (robust to outliers)
Standard deviation (SD): overall spread
Interquartile range (IQR): spread of the middle 50%

4.2 What to report (rule of thumb)¶

If the histogram looks symmetric: mean ± SD
If it looks skewed: median + IQR

5.0 Visualization as a decision tool (not decoration)¶

5.1 Quick chart chooser (practical)¶

Compare categories → bar chart (sorted)
Trend over time → line chart
Relationship between two numeric variables → scatter plot
Distribution → histogram / boxplot
Two dimensions (category × category) → heatmap

6.0 Mini case: “Sales and marketing” dataset (EDA story)¶

Question: “Which product categories grew, and is growth associated with marketing spend?”

EDA steps:

Check data types (date as datetime, spend as numeric)
Summarize missing values (overall + by category)
Plot distribution of sales (is it skewed?)
Plot sales over time (overall + by category)
Scatter: sales vs marketing (colored by category)
Write 3 insights + 1 caveat

7.0 Mini-lab (Google Colab)¶

In-class tasks (checkpoints)

Print data types and basic shape
Create a missingness summary table
Produce: (i) histogram/boxplot, (ii) line chart, (iii) bar chart by category
Submit 3 insights + 1 caveat in markdown inside the notebook

Submission

LINK

8.0 AI check (responsible use for EDA)¶

Good prompt examples

“Suggest 5 EDA checks for a dataset of customer transactions with dates, categories, and revenue.”
“Given these columns, what plots best answer ‘which category is growing’?”

Bad prompt example

“Write my EDA conclusions for me” (without your own verification and interpretation)

AI can help generate EDA ideas and code scaffolding, but interpretation and verification remain human responsibilities.

9.0 Review questions (quiz / reflection)¶

When would you report median and IQR instead of mean and SD?
Give one example of systematic missingness in a business dataset.
Which chart would you choose to show: (i) trend, (ii) distribution, (iii) group comparison—and why?

Kindly submit Reflection prompts for Lesson 1 and Lesson 2 here (before next class). https://docs.google.com/forms/d/e/1FAIpQLSfsq2ln4ru9G1Mtt8Huezzl52RbNw0SFnXyeMj9g2wuudt8UQ/viewform?usp=publish-editor

In the next lesson, we will begin organizing prompts, reflections, and workflows into a lightweight “Research OS” for managing analytical projects.