Panel data (or longitudinal data) combines cross-sectional and time-series dimensions, offering powerful advantages for causal inference by allowing us to control for unobserved, time-invariant characteristics.
7.1 What is Panel Data?
Panel data consists of observations on the same \(n\) entities (individuals, firms, countries) at two or more time periods \(T\).
Notation:\((X_{it}, Y_{it})\), where \(i = 1, \ldots, n\) and \(t = 1, \ldots, T\).
Example: Data on the same 100 companies (\(i\)) over 10 years (\(t\)).
The major advantage of panel data is its ability to help resolve omitted variable bias, a primary source of endogeneity. If the omitted variable is constant over time for each entity, panel data methods can effectively control for it.
7.2 The “Before and After” Comparison (First Differences)
A simple intuitive approach for two time periods (\(T=2\)) is to compare changes over time.
Suppose the “true” model includes an unobserved, time-invariant variable \(Z_i\) (e.g., managerial talent for a firm, innate ability for a person): \[
Y_{it} = c + \beta X_{it} + \gamma Z_i + u_{it}
\]
For time period \(t-1\), the model is: \[
Y_{it-1} = c + \beta X_{it-1} + \gamma Z_i + u_{it-1}
\]
Taking the difference between the two periods eliminates the time-invariant variable \(Z_i\): \[
Y_{it} - Y_{it-1} = \beta (X_{it} - X_{it-1}) + (u_{it} - u_{it-1})
\]
This First-Differenced (FD) model can be estimated by OLS. The key insight is that we do not need to observe \(Z_i\) to consistently estimate \(\beta\).
7.3 Fixed Effects Regression
The Fixed Effects (FE) model generalizes the “before and after” idea to more than two time periods. It controls for all unobserved, time-invariant characteristics of each entity.
7.3.1 The Fixed Effects Model
The model allows each entity to have its own intercept: \[
Y_{it} = c + \beta_1 X_{1,it} + \ldots + \beta_k X_{k,it} + \alpha_i + \epsilon_{it}
\]
where \(\alpha_i\) is the entity-specific fixed effect.
Here the error term is decomposed into two parts: \(\alpha_i\) which captures all unobserved variables that are constant over time for entity \(i\), and; \(\epsilon_{it}\) is the usual stochastic term.
7.3.2 Estimation Methods
There are two common ways to estimate the Fixed Effects model:
7.3.2.1 1. Least Squares Dummy Variable (LSDV) Regression
The model can be written with a common intercept and dummy variables for each entity (except one, to avoid perfect multicollinearity, a.k.a “dummy regression trap”): \[
Y_{it} = c + \beta_1 X_{1,it} + \ldots + \beta_k X_{k,it} + \delta_1 D_{1,i} + \delta_2 D_{2, i} + \ldots + \delta_{n-1} D_{n-1,i} + u_{it}
\] where \(D_{1,i}\) is a dummy variable equal to 1 for the first entity, and so on, up to \(n-1\).
Disadvantage: With a large number of entities (\(n\)), you lose many degrees of freedom by estimating \(n-1\) dummy coefficients.
7.3.2.2 2. The “Entity-Demeaned” OLS Algorithm (Preferred)
This is the more efficient computational method whichinvolves subtracting the entity-specific mean from each variable.
Calculate the entity-specific averages: \(\bar{Y}_i = \frac{1}{T}\sum_{t=1}^T Y_{it}\) and \(\bar{X}_i = \frac{1}{T}\sum_{t=1}^T X_{it}\).
Demean the data: \(Y_{it} - \bar{Y}_i\) and \(X_{it} - \bar{X}_i\).
Run an OLS regression on the transformed (demeaned) variables: \[
(Y_{it} - \bar{Y}_i) = \beta_1 (X_{1,it} - \bar{X}_{1,i}) + \ldots + \beta_k (X_{k,it} - \bar{X}_{k,i}) + (u_{it} - \bar{u}_i)
\]
This process, known as the within transformation, and effectively sweeps out the fixed effect \(\alpha_i\).
7.3.3 Time Fixed Effects
In addition to entity-specific effects, there might be time-specific effects that affect all entities in a given time period (e.g., a common economic shock in a particular year).
A model with both entity and time fixed effects is: \[
Y_{it} = c + \beta_1 X_{1,it} + \ldots + \beta_k X_{k,it} + \alpha_i + \lambda_t + u_{it}
\] where \(\lambda_t\) is the time fixed effect. This can be estimated by including entity dummies and time period dummies (or by demeaning both with respect to entities and time, or demeaning entity but including time period dummies less one).
7.3.4 Testing for Fixed Effects
You can test the joint significance of the entity fixed effects using an F-test that compares the Fixed Effects model to a simple OLS model (pooled regression) with a single constant term.
\(H_0: \alpha_1 = \alpha_2 = \ldots = \alpha_n\) (No entity-specific effects; pooled OLS is fine).
\(H_1\): The \(\alpha_i\) are not all equal (Fixed Effects model is appropriate).
7.4 Random Effects Model
The Random Effects (RE) model is an alternative estimator when the unobserved entity-specific effect is uncorrelated with all the explanatory variables.
7.4.1 The Random Effects Model
The model treats the entity-specific intercept as a random variable: \[
Y_{it} = c + \beta_1 X_{1,it} + \ldots + \beta_k X_{k,it} + u_{it}
\]
where, as before, the composite error term can be expressed as \(u_{it} = \alpha_i + \epsilon_{it}\).
Here the key assumption is that \(\alpha_i\) (the random effect) is uncorrelated with the \(X's\) , i.e., \(Cov(\alpha_i, X_{it}) = 0\).
7.4.2 Fixed Effects vs. Random Effects
Fixed Effects (FE): Use when the unobserved entity-specific effect \(\alpha_i\) is likely to be correlated with the explanatory variables \(X_{it}\). FE should mitigate endogeneity issues as it eliminates this source of bias.
Random Effects (RE): Use when \(\alpha_i\) is uncorrelated with the \(X_{it}\). RE is more efficient (provides smaller standard errors) than FE if this assumption holds.
The Hausman test is often used to decide between FE and RE models.
\(H_0\): The RE assumption is valid (\(Cov(\alpha_i, X_{it}) = 0\)).
\(H_1\): \(H_0\) is false. The FE model is consistent.
Ans as usual, a low p-value indicates that the Fixed Effects model is preferred.
7.5 Implementation in R
# Load necessary packages# install.packages("plm")library(plm) # Panel data econometrics in R# Use the Grunfeld dataset (a classic panel dataset included in the plm package)# This dataset contains investment data for 10 US firms from 1935-1954data("Grunfeld", package ="plm")# Look at the structure of the datahead(Grunfeld)
# Note: firm = company identifier, year = time identifier# inv = investment, value = value of the firm, capital = stock of plant and equipment# Create a panel data frame. 'index' specifies the entity and time identifiers.p.data <-pdata.frame(Grunfeld, index =c("firm", "year"))# 1. Pooled OLS (ignoring panel structure)pooled_model <-plm(inv ~ value + capital, data = p.data, model ="pooling")summary(pooled_model)
Pooling Model
Call:
plm(formula = inv ~ value + capital, data = p.data, model = "pooling")
Balanced Panel: n = 10, T = 20, N = 200
Residuals:
Min. 1st Qu. Median 3rd Qu. Max.
-291.6757 -30.0137 5.3033 34.8293 369.4464
Coefficients:
Estimate Std. Error t-value Pr(>|t|)
(Intercept) -42.7143694 9.5116760 -4.4907 1.207e-05 ***
value 0.1155622 0.0058357 19.8026 < 2.2e-16 ***
capital 0.2306785 0.0254758 9.0548 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Total Sum of Squares: 9359900
Residual Sum of Squares: 1755900
R-Squared: 0.81241
Adj. R-Squared: 0.8105
F-statistic: 426.576 on 2 and 197 DF, p-value: < 2.22e-16
# 2. Fixed Effects (Entity Demeaned)fe_model <-plm(inv ~ value + capital, data = p.data, model ="within")summary(fe_model)
Oneway (individual) effect Within Model
Call:
plm(formula = inv ~ value + capital, data = p.data, model = "within")
Balanced Panel: n = 10, T = 20, N = 200
Residuals:
Min. 1st Qu. Median 3rd Qu. Max.
-184.00857 -17.64316 0.56337 19.19222 250.70974
Coefficients:
Estimate Std. Error t-value Pr(>|t|)
value 0.110124 0.011857 9.2879 < 2.2e-16 ***
capital 0.310065 0.017355 17.8666 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Total Sum of Squares: 2244400
Residual Sum of Squares: 523480
R-Squared: 0.76676
Adj. R-Squared: 0.75311
F-statistic: 309.014 on 2 and 188 DF, p-value: < 2.22e-16
# 3. Random Effectsre_model <-plm(inv ~ value + capital, data = p.data, model ="random")summary(re_model)
Oneway (individual) effect Random Effect Model
(Swamy-Arora's transformation)
Call:
plm(formula = inv ~ value + capital, data = p.data, model = "random")
Balanced Panel: n = 10, T = 20, N = 200
Effects:
var std.dev share
idiosyncratic 2784.46 52.77 0.282
individual 7089.80 84.20 0.718
theta: 0.8612
Residuals:
Min. 1st Qu. Median 3rd Qu. Max.
-177.6063 -19.7350 4.6851 19.5105 252.8743
Coefficients:
Estimate Std. Error z-value Pr(>|z|)
(Intercept) -57.834415 28.898935 -2.0013 0.04536 *
value 0.109781 0.010493 10.4627 < 2e-16 ***
capital 0.308113 0.017180 17.9339 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Total Sum of Squares: 2381400
Residual Sum of Squares: 548900
R-Squared: 0.7695
Adj. R-Squared: 0.76716
Chisq: 657.674 on 2 DF, p-value: < 2.22e-16
# 4. Hausman Test to choose between FE and REhausman_test <-phtest(fe_model, re_model)print(hausman_test)
Hausman Test
data: inv ~ value + capital
chisq = 2.3304, df = 2, p-value = 0.3119
alternative hypothesis: one model is inconsistent
# 5. Fixed Effects with both Entity and Time Effectsfe_twoway_model <-plm(inv ~ value + capital, data = p.data, model ="within", effect ="twoways")summary(fe_twoway_model)
Twoways effects Within Model
Call:
plm(formula = inv ~ value + capital, data = p.data, effect = "twoways",
model = "within")
Balanced Panel: n = 10, T = 20, N = 200
Residuals:
Min. 1st Qu. Median 3rd Qu. Max.
-162.6094 -19.4710 -1.2669 19.1277 211.8420
Coefficients:
Estimate Std. Error t-value Pr(>|t|)
value 0.117716 0.013751 8.5604 6.653e-15 ***
capital 0.357916 0.022719 15.7540 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Total Sum of Squares: 1615600
Residual Sum of Squares: 452150
R-Squared: 0.72015
Adj. R-Squared: 0.67047
F-statistic: 217.442 on 2 and 169 DF, p-value: < 2.22e-16
# 6. First Differences model (for T=2 periods)# Let's create a subset with just 2 years to demonstrateGrunfeld_2years <- Grunfeld[Grunfeld$year %in%c(1935, 1936), ]p.data_2years <-pdata.frame(Grunfeld_2years, index =c("firm", "year"))fd_model <-plm(inv ~ value + capital, data = p.data_2years, model ="fd")summary(fd_model)
Oneway (individual) effect First-Difference Model
Call:
plm(formula = inv ~ value + capital, data = p.data_2years, model = "fd")
Balanced Panel: n = 10, T = 2, N = 20
Observations used in estimation: 10
Residuals:
Min. 1st Qu. Median 3rd Qu. Max.
-51.5451 -17.3865 -6.3621 7.1823 96.6838
Coefficients:
Estimate Std. Error t-value Pr(>|t|)
(Intercept) 17.901125 18.516740 0.9668 0.3659
value 0.061786 0.034264 1.8032 0.1143
capital -1.011766 1.065441 -0.9496 0.3739
Total Sum of Squares: 19847
Residual Sum of Squares: 13551
R-Squared: 0.31722
Adj. R-Squared: 0.12214
F-statistic: 1.62613 on 2 and 7 DF, p-value: 0.26301