Lesson 4 — Data Collection: Getting Data into Colab (Finance, WDI, OWID, News)

(choose-one-track lab + course-standard sanity checks + data source note)

Why this matters¶

Analytics and AI are only as good as the data pipeline behind them. A large part of professional work is:

finding the right dataset,
documenting what it is and where it came from,
and ensuring you can reproduce the same results later.

Three ways to get data (concept map)¶

Today’s lab format: choose ONE track¶

You do not need to run everything. Choose one source and produce the same outputs.

Track A: Yahoo Finance (market time series)
Track B: World Bank WDI (cross-country indicators)
Track C: Our World in Data (OWID Grapher CSV)
Track D: News feed API (GDELT, no API key)

Colab notebook (live)¶

Use the class notebook here:

https://drive.google.com/file/d/1oZC6dHbWaWoC1MiG4Bt-_vG_okcvbMHR/view?usp=sharing

Course-standard sanity checks (use after loading data)¶

These checks help you avoid common mistakes:

wrong types,
unexpected missingness,
wrong date range.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

pd.set_option("display.max_columns", 50)

def sanity_checks(df, time_col=None, name="df"):
    print(f"--- {name} ---")
    print("shape:", df.shape)
    print("columns:", list(df.columns)[:12], "..." if len(df.columns) > 12 else "")
    print("\nmissingness (top 10):")
    miss = df.isna().mean().sort_values(ascending=False).head(10)
    print(miss)
    if time_col is not None and time_col in df.columns:
        print("\n time_col:", time_col)
        print(" min:", df[time_col].min(), " max:", df[time_col].max())
    print("--------------\n")

Track A — Yahoo Finance (yfinance)¶

Use case: market data time series (stocks/ETF/FX/crypto) Deliverable idea: plot close price; compute a simple summary table (monthly average close).

!pip -q install yfinance
import yfinance as yf

ticker = "AAPL"   # try: "MSFT", "TSLA", "SPY", "BTC-USD", "^GSPC"
start_date = "2023-01-01"
end_date   = "2024-12-31"

px = yf.download(ticker, start=start_date, end=end_date, progress=False).reset_index()
px.head()

sanity_checks(px, time_col="Date", name="prices")

Plot (required):

plt.figure()
plt.plot(px["Date"], px["Close"])
plt.title(f"{ticker} Close Price")
plt.xlabel("Date")
plt.ylabel("Close")
plt.show()

Table (required): monthly average close

px["month"] = px["Date"].dt.to_period("M").astype(str)
monthly = px.groupby("month")["Close"].mean().reset_index().rename(columns={"Close": "avg_close"})
monthly.head(10)

Track B — World Bank WDI (wbgapi)¶

Use case: cross-country indicators (GDP per capita, inflation, population, etc.) Deliverable idea: plot 3 countries over time + a simple “latest year” table.

!pip -q install wbgapi
import wbgapi as wb

indicator = "NY.GDP.PCAP.CD"  # try: "FP.CPI.TOTL.ZG" (inflation), "SP.POP.TOTL" (population)
countries = ["JPN", "THA", "USA"]  # ISO3 codes
years = range(2000, 2024)

wdi = wb.data.DataFrame(indicator, countries, time=years).reset_index()
wdi = wdi.rename(columns={"economy": "country_code", "time": "year", indicator: "value"})
wdi.head()

sanity_checks(wdi, time_col="year", name="wdi")

Plot (required):

plt.figure()
for c in countries:
    tmp = wdi[wdi["country_code"] == c].sort_values("year")
    plt.plot(tmp["year"], tmp["value"], label=c)

plt.title(f"WDI indicator: {indicator}")
plt.xlabel("Year")
plt.ylabel("Value")
plt.legend()
plt.show()

Table (required): latest year value

latest = wdi.sort_values("year").groupby("country_code").tail(1)[["country_code", "year", "value"]]
latest

Track C — Our World in Data (OWID Grapher CSV)¶

Use case: curated public datasets with consistent country/time structure Deliverable idea: plot 3 countries + “latest year” table.

url = "https://ourworldindata.org/grapher/co2-per-capita.csv"  # you can change the OWID slug
owid = pd.read_csv(url)
owid.head()

sanity_checks(owid, time_col="Year", name="owid")

countries = ["Japan", "Thailand", "United States"]  # adjust as needed
value_col = [c for c in owid.columns if c not in ["Entity", "Code", "Year"]][0]  # auto-pick
owid_sub = owid[owid["Entity"].isin(countries)].copy()

Plot (required):

plt.figure()
for c in countries:
    tmp = owid_sub[owid_sub["Entity"] == c].sort_values("Year")
    plt.plot(tmp["Year"], tmp[value_col], label=c)

plt.title(f"OWID: {value_col}")
plt.xlabel("Year")
plt.ylabel(value_col)
plt.legend()
plt.show()

Table (required): latest year by country

latest = (owid_sub.sort_values("Year")
          .groupby("Entity").tail(1)[["Entity", "Year", value_col]]
          .rename(columns={"Entity": "country"}))
latest

Track D — News feed API (GDELT, no key)¶

Use case: build a small headline dataset for later text-as-data/NLP work Deliverable idea: 50 recent articles on a topic + basic counts by language/country.

import requests

query = "inflation"  # try: "electric vehicles", "unemployment", "AI regulation"
url = (
    "https://api.gdeltproject.org/api/v2/doc/doc"
    f"?query={query}&mode=ArtList&format=json&maxrecords=50&sort=datedesc"
)

res = requests.get(url, timeout=30)
res.raise_for_status()
payload = res.json()

articles = pd.DataFrame(payload.get("articles", []))
news = articles[["seendate", "sourcecountry", "language", "title", "url"]].copy()
news.head(10)

sanity_checks(news, name="news")

Table (required):

news["language"].value_counts().head(10), news["sourceCountry"].value_counts().head(10)

Optional: save for later NLP

news.to_csv("gdelt_news.csv", index=False)
print("Saved gdelt_news.csv")

AI check (responsible use)¶

Data Source Note (required)¶

Fill this in for your chosen track.

Dataset name:
Source (organization / provider):
Access method (URL / package / API endpoint):
Date accessed:
Unit of observation: (e.g., country-year, day-price, article)
Variables used:
Filters applied: (countries, years, ticker, query terms, etc.)
Known limitations: (coverage, missingness, definitions, comparability, API stability)

Prompt & Workflow Log (only if you used AI tools)¶

If you used any AI assistance (e.g., ChatGPT) for code or writing, paste:

prompts used,
what you copied vs changed,
how you verified correctness (sanity checks, reruns, spot checks).

If you did not use AI tools, write: “No AI assistance used.”

Review questions (quiz / reflection)¶

When is an API better than downloading a CSV?
What must be included in a Data Source Note for reproducibility?
What is one limitation of cross-country comparisons using public datasets?
What is one bias risk when using news data as evidence?

Mini-lab (Google Colab)¶

In-class checkpoints (complete in the Colab notebook)¶

Choose ONE data source track in the notebook (OWID / WDI / yfinance / News-GDELT), then complete:

Load a dataset into a DataFrame.
Apply a reasonable filter (e.g., 3–5 countries + time range, ticker + time range, or query + number of articles).
Produce one plot and one summary table.
Write a Data Source Note and one limitation.

Reflection prompt (submit via LMS)¶

After completing the mini-lab in the Colab notebook, submit brief answers in the link below to the following:

Which data source track did you choose (OWID / WDI / yfinance / News-GDELT) and why?
What exactly did you analyze?
- OWID/WDI: dataset/indicator name (and code/slug if applicable), countries, time range
- yfinance: ticker(s), time range
- News-GDELT: query term(s), number of articles, time sorting/filters
What is the main pattern shown in your plot (1–2 sentences)?
What is one limitation of the dataset or your comparison (missingness, definitions, coverage, comparability, API bias, etc.)?
Paste your Data Source Note (source, access method, date accessed, variables used, filters applied, limitation).