7  Panel Data Regression

Panel data (or longitudinal data) combines cross-sectional and time-series dimensions, offering powerful advantages for causal inference by allowing us to control for unobserved, time-invariant characteristics.

7.1 What is Panel Data?

Panel data consists of observations on the same \(n\) entities (individuals, firms, countries) at two or more time periods \(T\).

  • Notation: \((X_{it}, Y_{it})\), where \(i = 1, \ldots, n\) and \(t = 1, \ldots, T\).
  • Example: Data on the same 100 companies (\(i\)) over 10 years (\(t\)).

The major advantage of panel data is its ability to help resolve omitted variable bias, a primary source of endogeneity. If the omitted variable is constant over time for each entity, panel data methods can effectively control for it.

7.2 The “Before and After” Comparison (First Differences)

A simple intuitive approach for two time periods (\(T=2\)) is to compare changes over time.

Suppose the “true” model includes an unobserved, time-invariant variable \(Z_i\) (e.g., managerial talent for a firm, innate ability for a person): \[ Y_{it} = c + \beta X_{it} + \gamma Z_i + u_{it} \]

For time period \(t-1\), the model is: \[ Y_{it-1} = c + \beta X_{it-1} + \gamma Z_i + u_{it-1} \]

Taking the difference between the two periods eliminates the time-invariant variable \(Z_i\): \[ Y_{it} - Y_{it-1} = \beta (X_{it} - X_{it-1}) + (u_{it} - u_{it-1}) \]

This First-Differenced (FD) model can be estimated by OLS. The key insight is that we do not need to observe \(Z_i\) to consistently estimate \(\beta\).

7.3 Fixed Effects Regression

The Fixed Effects (FE) model generalizes the “before and after” idea to more than two time periods. It controls for all unobserved, time-invariant characteristics of each entity.

7.3.1 The Fixed Effects Model

The model allows each entity to have its own intercept: \[ Y_{it} = c + \beta_1 X_{1,it} + \ldots + \beta_k X_{k,it} + \alpha_i + \epsilon_{it} \]

where \(\alpha_i\) is the entity-specific fixed effect.

Here the error term is decomposed into two parts: \(\alpha_i\) which captures all unobserved variables that are constant over time for entity \(i\), and; \(\epsilon_{it}\) is the usual stochastic term.

7.3.2 Estimation Methods

There are two common ways to estimate the Fixed Effects model:

7.3.2.1 1. Least Squares Dummy Variable (LSDV) Regression

The model can be written with a common intercept and dummy variables for each entity (except one, to avoid perfect multicollinearity, a.k.a “dummy regression trap”): \[ Y_{it} = c + \beta_1 X_{1,it} + \ldots + \beta_k X_{k,it} + \delta_1 D_{1,i} + \delta_2 D_{2, i} + \ldots + \delta_{n-1} D_{n-1,i} + u_{it} \] where \(D_{1,i}\) is a dummy variable equal to 1 for the first entity, and so on, up to \(n-1\).

  • Disadvantage: With a large number of entities (\(n\)), you lose many degrees of freedom by estimating \(n-1\) dummy coefficients.

7.3.2.2 2. The “Entity-Demeaned” OLS Algorithm (Preferred)

This is the more efficient computational method whichinvolves subtracting the entity-specific mean from each variable.

  1. Calculate the entity-specific averages: \(\bar{Y}_i = \frac{1}{T}\sum_{t=1}^T Y_{it}\) and \(\bar{X}_i = \frac{1}{T}\sum_{t=1}^T X_{it}\).
  2. Demean the data: \(Y_{it} - \bar{Y}_i\) and \(X_{it} - \bar{X}_i\).
  3. Run an OLS regression on the transformed (demeaned) variables: \[ (Y_{it} - \bar{Y}_i) = \beta_1 (X_{1,it} - \bar{X}_{1,i}) + \ldots + \beta_k (X_{k,it} - \bar{X}_{k,i}) + (u_{it} - \bar{u}_i) \]

This process, known as the within transformation, and effectively sweeps out the fixed effect \(\alpha_i\).

7.3.3 Time Fixed Effects

In addition to entity-specific effects, there might be time-specific effects that affect all entities in a given time period (e.g., a common economic shock in a particular year).

A model with both entity and time fixed effects is: \[ Y_{it} = c + \beta_1 X_{1,it} + \ldots + \beta_k X_{k,it} + \alpha_i + \lambda_t + u_{it} \] where \(\lambda_t\) is the time fixed effect. This can be estimated by including entity dummies and time period dummies (or by demeaning both with respect to entities and time, or demeaning entity but including time period dummies less one).

7.3.4 Testing for Fixed Effects

You can test the joint significance of the entity fixed effects using an F-test that compares the Fixed Effects model to a simple OLS model (pooled regression) with a single constant term.

\(H_0: \alpha_1 = \alpha_2 = \ldots = \alpha_n\) (No entity-specific effects; pooled OLS is fine).

\(H_1\): The \(\alpha_i\) are not all equal (Fixed Effects model is appropriate).

7.4 Random Effects Model

The Random Effects (RE) model is an alternative estimator when the unobserved entity-specific effect is uncorrelated with all the explanatory variables.

7.4.1 The Random Effects Model

The model treats the entity-specific intercept as a random variable: \[ Y_{it} = c + \beta_1 X_{1,it} + \ldots + \beta_k X_{k,it} + u_{it} \]

where, as before, the composite error term can be expressed as \(u_{it} = \alpha_i + \epsilon_{it}\).

Here the key assumption is that \(\alpha_i\) (the random effect) is uncorrelated with the \(X's\) , i.e., \(Cov(\alpha_i, X_{it}) = 0\).

7.4.2 Fixed Effects vs. Random Effects

  • Fixed Effects (FE): Use when the unobserved entity-specific effect \(\alpha_i\) is likely to be correlated with the explanatory variables \(X_{it}\). FE should mitigate endogeneity issues as it eliminates this source of bias.
  • Random Effects (RE): Use when \(\alpha_i\) is uncorrelated with the \(X_{it}\). RE is more efficient (provides smaller standard errors) than FE if this assumption holds.

The Hausman test is often used to decide between FE and RE models.

\(H_0\): The RE assumption is valid (\(Cov(\alpha_i, X_{it}) = 0\)).

\(H_1\): \(H_0\) is false. The FE model is consistent.

Ans as usual, a low p-value indicates that the Fixed Effects model is preferred.

7.5 Implementation in R

# Load necessary packages
# install.packages("plm")
library(plm) # Panel data econometrics in R

# Use the Grunfeld dataset (a classic panel dataset included in the plm package)
# This dataset contains investment data for 10 US firms from 1935-1954
data("Grunfeld", package = "plm")

# Look at the structure of the data
head(Grunfeld)
  firm year   inv  value capital
1    1 1935 317.6 3078.5     2.8
2    1 1936 391.8 4661.7    52.6
3    1 1937 410.6 5387.1   156.9
4    1 1938 257.7 2792.2   209.2
5    1 1939 330.8 4313.2   203.4
6    1 1940 461.2 4643.9   207.2
# Note: firm = company identifier, year = time identifier
#       inv = investment, value = value of the firm, capital = stock of plant and equipment

# Create a panel data frame. 'index' specifies the entity and time identifiers.
p.data <- pdata.frame(Grunfeld, index = c("firm", "year"))

# 1. Pooled OLS (ignoring panel structure)
pooled_model <- plm(inv ~ value + capital, data = p.data, model = "pooling")
summary(pooled_model)
Pooling Model

Call:
plm(formula = inv ~ value + capital, data = p.data, model = "pooling")

Balanced Panel: n = 10, T = 20, N = 200

Residuals:
     Min.   1st Qu.    Median   3rd Qu.      Max. 
-291.6757  -30.0137    5.3033   34.8293  369.4464 

Coefficients:
               Estimate  Std. Error t-value  Pr(>|t|)    
(Intercept) -42.7143694   9.5116760 -4.4907 1.207e-05 ***
value         0.1155622   0.0058357 19.8026 < 2.2e-16 ***
capital       0.2306785   0.0254758  9.0548 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Total Sum of Squares:    9359900
Residual Sum of Squares: 1755900
R-Squared:      0.81241
Adj. R-Squared: 0.8105
F-statistic: 426.576 on 2 and 197 DF, p-value: < 2.22e-16
# 2. Fixed Effects (Entity Demeaned)
fe_model <- plm(inv ~ value + capital, data = p.data, model = "within")
summary(fe_model)
Oneway (individual) effect Within Model

Call:
plm(formula = inv ~ value + capital, data = p.data, model = "within")

Balanced Panel: n = 10, T = 20, N = 200

Residuals:
      Min.    1st Qu.     Median    3rd Qu.       Max. 
-184.00857  -17.64316    0.56337   19.19222  250.70974 

Coefficients:
        Estimate Std. Error t-value  Pr(>|t|)    
value   0.110124   0.011857  9.2879 < 2.2e-16 ***
capital 0.310065   0.017355 17.8666 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Total Sum of Squares:    2244400
Residual Sum of Squares: 523480
R-Squared:      0.76676
Adj. R-Squared: 0.75311
F-statistic: 309.014 on 2 and 188 DF, p-value: < 2.22e-16
# 3. Random Effects
re_model <- plm(inv ~ value + capital, data = p.data, model = "random")
summary(re_model)
Oneway (individual) effect Random Effect Model 
   (Swamy-Arora's transformation)

Call:
plm(formula = inv ~ value + capital, data = p.data, model = "random")

Balanced Panel: n = 10, T = 20, N = 200

Effects:
                  var std.dev share
idiosyncratic 2784.46   52.77 0.282
individual    7089.80   84.20 0.718
theta: 0.8612

Residuals:
     Min.   1st Qu.    Median   3rd Qu.      Max. 
-177.6063  -19.7350    4.6851   19.5105  252.8743 

Coefficients:
              Estimate Std. Error z-value Pr(>|z|)    
(Intercept) -57.834415  28.898935 -2.0013  0.04536 *  
value         0.109781   0.010493 10.4627  < 2e-16 ***
capital       0.308113   0.017180 17.9339  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Total Sum of Squares:    2381400
Residual Sum of Squares: 548900
R-Squared:      0.7695
Adj. R-Squared: 0.76716
Chisq: 657.674 on 2 DF, p-value: < 2.22e-16
# 4. Hausman Test to choose between FE and RE
hausman_test <- phtest(fe_model, re_model)
print(hausman_test)

    Hausman Test

data:  inv ~ value + capital
chisq = 2.3304, df = 2, p-value = 0.3119
alternative hypothesis: one model is inconsistent
# 5. Fixed Effects with both Entity and Time Effects
fe_twoway_model <- plm(inv ~ value + capital, data = p.data, model = "within", effect = "twoways")
summary(fe_twoway_model)
Twoways effects Within Model

Call:
plm(formula = inv ~ value + capital, data = p.data, effect = "twoways", 
    model = "within")

Balanced Panel: n = 10, T = 20, N = 200

Residuals:
     Min.   1st Qu.    Median   3rd Qu.      Max. 
-162.6094  -19.4710   -1.2669   19.1277  211.8420 

Coefficients:
        Estimate Std. Error t-value  Pr(>|t|)    
value   0.117716   0.013751  8.5604 6.653e-15 ***
capital 0.357916   0.022719 15.7540 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Total Sum of Squares:    1615600
Residual Sum of Squares: 452150
R-Squared:      0.72015
Adj. R-Squared: 0.67047
F-statistic: 217.442 on 2 and 169 DF, p-value: < 2.22e-16
# 6. First Differences model (for T=2 periods)
# Let's create a subset with just 2 years to demonstrate
Grunfeld_2years <- Grunfeld[Grunfeld$year %in% c(1935, 1936), ]
p.data_2years <- pdata.frame(Grunfeld_2years, index = c("firm", "year"))
fd_model <- plm(inv ~ value + capital, data = p.data_2years, model = "fd")
summary(fd_model)
Oneway (individual) effect First-Difference Model

Call:
plm(formula = inv ~ value + capital, data = p.data_2years, model = "fd")

Balanced Panel: n = 10, T = 2, N = 20
Observations used in estimation: 10

Residuals:
    Min.  1st Qu.   Median  3rd Qu.     Max. 
-51.5451 -17.3865  -6.3621   7.1823  96.6838 

Coefficients:
             Estimate Std. Error t-value Pr(>|t|)
(Intercept) 17.901125  18.516740  0.9668   0.3659
value        0.061786   0.034264  1.8032   0.1143
capital     -1.011766   1.065441 -0.9496   0.3739

Total Sum of Squares:    19847
Residual Sum of Squares: 13551
R-Squared:      0.31722
Adj. R-Squared: 0.12214
F-statistic: 1.62613 on 2 and 7 DF, p-value: 0.26301