Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Chapter 17 — Spurious Regression

In earlier chapters, we learned that many time series are nonstationary. We also saw that random walks can drift over time even when they are generated only by random shocks.

We now study one of the most important warnings in applied time series regression:

This problem is especially common when working with nonstationary data, such as random walks or trending variables.


Learning Objectives

By the end of this chapter, you should be able to:


17.1 The Basic Problem

Suppose we regress one time series on another:

yt=α+βxt+uty_t = \alpha + \beta x_t + u_t

In ordinary regression analysis, a statistically significant β\beta might suggest that xtx_t is related to yty_t.

But with time series data, especially nonstationary data, this conclusion can be misleading.

This is the problem of spurious regression.

The model may report:

But these are driven by shared trending behavior — not by a true economic relationship.


17.2 Why Nonstationarity Creates Trouble

Consider two independent random walks:

xt=xt1+etx_t = x_{t-1} + e_t

and

yt=yt1+vty_t = y_{t-1} + v_t

where ete_t and vtv_t are independent white noise processes.

By construction, there is no true relationship between xtx_t and yty_t.

Yet both series may drift over time.

If two unrelated random walks drift in similar directions over a sample period, a regression may interpret that common movement as evidence of a relationship.


17.3 Symptoms of Spurious Regression

A spurious regression often produces:

This is one reason why time series regression requires additional diagnostic checks.


17.4 A Simple Simulation

Let us simulate two independent random walks.

import numpy as np
import matplotlib.pyplot as plt

np.random.seed(124)

n = 200

u = np.random.standard_normal(n)
v = np.random.standard_normal(n)

x = np.cumsum(u)
y = np.cumsum(v)

plt.figure(figsize=(10, 5))
plt.plot(x, label="x")
plt.plot(y, label="y")
plt.title("Two Independent Random Walks")
plt.xlabel("Time")
plt.ylabel("Value")
plt.legend()

plt.savefig("figs/ch17/rw.png", dpi=300, bbox_inches="tight")
plt.close()   # replace with plt.show()
RW

17.5 Regressing One Random Walk on Another

Now regress yty_t on xtx_t:

yt=α+βxt+uty_t = \alpha + \beta x_t + u_t
import statsmodels.api as sm

X = sm.add_constant(x)
model = sm.OLS(y, X)
results = model.fit()

print(results.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.001
Model:                            OLS   Adj. R-squared:                 -0.004
Method:                 Least Squares   F-statistic:                    0.2312
Date:                Wed, 29 Apr 2026   Prob (F-statistic):              0.631
Time:                        14:10:42   Log-Likelihood:                -607.38
No. Observations:                 200   AIC:                             1219.
Df Residuals:                     198   BIC:                             1225.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          5.4880      0.469     11.692      0.000       4.562       6.414
x1             0.0249      0.052      0.481      0.631      -0.077       0.127
==============================================================================
Omnibus:                       15.351   Durbin-Watson:                   0.038
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               16.030
Skew:                          -0.654   Prob(JB):                     0.000330
Kurtosis:                       2.538   Cond. No.                         12.0
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

The output will often show a statistically significant coefficient, even though the two variables were generated independently.

This is the essence of spurious regression.


17.6 Why This Is Problematic

Both xtx_t and yty_t are nonstationary random walks.

Standard regression inference relies on assumptions that are violated in this setting.

This can lead to:


17.7 Residual Diagnostics

A key diagnostic is to examine the residuals.

If a regression relationship is meaningful, the residuals should be stationary.

If the residuals remain nonstationary, the regression has not captured a stable relationship.


Residual Plot

uhat = results.resid

plt.figure(figsize=(10, 4))
plt.plot(uhat, linewidth=1.5)
plt.axhline(0, color="black", linestyle="--", linewidth=1)
plt.title("Residuals from Spurious Regression")
plt.xlabel("Time")
plt.ylabel("Residual")

plt.savefig("figs/ch17/res.png", dpi=300, bbox_inches="tight")
plt.close()   # replace with plt.show()
Residual

Residual ACF

from statsmodels.graphics.tsaplots import plot_acf

plot_acf(uhat, lags=40)
plt.title("ACF of Residuals")

plt.savefig("figs/ch17/res_acf.png", dpi=300, bbox_inches="tight")
plt.close()   # replace with plt.show()
Residual ACF

17.8 Unit Root Testing

To formally check nonstationarity, we can perform unit root tests.

For the original series, we expect:

For a spurious regression, residuals are often nonstationary as well.

from statsmodels.tsa.stattools import adfuller

def adf_summary(series, name):
    result = adfuller(series)
    print(f"ADF test for {name}")
    print(f"ADF statistic: {result[0]:.4f}")
    print(f"p-value: {result[1]:.4f}")
    print("Critical values:")
    for key, value in result[4].items():
        print(f"  {key}: {value:.4f}")
    print()

adf_summary(x, "x")
adf_summary(y, "y")
adf_summary(uhat, "residuals")
ADF test for x
ADF statistic: -0.7378
p-value: 0.8367
Critical values:
  1%: -3.4638
  5%: -2.8763
  10%: -2.5746

ADF test for y
ADF statistic: -0.6210
p-value: 0.8662
Critical values:
  1%: -3.4636
  5%: -2.8762
  10%: -2.5746

ADF test for residuals
ADF statistic: -0.5837
p-value: 0.8746
Critical values:
  1%: -3.4636
  5%: -2.8762
  10%: -2.5746

17.9 Fixing the Problem: Differencing

A common solution is to difference the data.

Instead of estimating:

yt=α+βxt+uty_t = \alpha + \beta x_t + u_t

we estimate:

Δyt=α+βΔxt+ut\Delta y_t = \alpha + \beta \Delta x_t + u_t

where:

Δxt=xtxt1\Delta x_t = x_t - x_{t-1}

and:

Δyt=ytyt1\Delta y_t = y_t - y_{t-1}

Regression in Differences

d_x = np.diff(x)
d_y = np.diff(y)

D_X = sm.add_constant(d_x)

diff_model = sm.OLS(d_y, D_X)
diff_results = diff_model.fit()

print(diff_results.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.005
Model:                            OLS   Adj. R-squared:                 -0.000
Method:                 Least Squares   F-statistic:                    0.9755
Date:                Wed, 29 Apr 2026   Prob (F-statistic):              0.325
Time:                        14:14:36   Log-Likelihood:                -279.21
No. Observations:                 199   AIC:                             562.4
Df Residuals:                     197   BIC:                             569.0
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         -0.0405      0.070     -0.576      0.565      -0.179       0.098
x1             0.0719      0.073      0.988      0.325      -0.072       0.215
==============================================================================
Omnibus:                        0.339   Durbin-Watson:                   1.967
Prob(Omnibus):                  0.844   Jarque-Bera (JB):                0.181
Skew:                          -0.067   Prob(JB):                        0.914
Kurtosis:                       3.063   Cond. No.                         1.08
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

This is what we should expect, because the two random walks were generated from independent shocks.


17.10 Interpretation

Spurious regression arises because nonstationary series can move together over time even when there is no true relationship.

Differencing often solves the problem by removing the stochastic trend.

However, differencing is not always the final answer.

Sometimes nonstationary variables really do move together because of a long-run equilibrium relationship.

This leads to the idea of cointegration.


17.11 Connection to Cointegration

There is one important exception to the warning about regressions in levels.

In other words:

We return to this idea in Chapter 20.


17.12 GRETL Example

We now reproduce the simulation and regression in GRETL.

Step 1: Create Data in GRETL

  1. Set up for data entry:
    File → New data set

  2. Select:
    Time series: T = 200

  3. Choose frequency:
    Other

This creates an index variable.

Then generate two white-noise variables:

Add → Random variable → Normal distribution

Generate:

Then create cumulative sums:

Add → Define new variable

Use:

x = cum(u)
y = cum(v)

Command

nulldata 200
set seed 124

series u = normal()
series v = normal()

series x = cum(u)
series y = cum(v)

Step 2: Plot the Series

Graph → Time series plot

Select:

Command

gnuplot x y --time-series --with-lines
Two independent random walks

Figure 1:Two independent random walks

Step 3: Run the Spurious Regression

Model → Ordinary Least Squares

Command

ols y const x

Example output:

Model 1: OLS, using observations 1-200
Dependent variable: y

             coefficient   std. error   t-ratio    p-value 
  ---------------------------------------------------------
  const       0.578591     0.295706      1.957    0.0518    *
  x           0.660102     0.0333565    19.79     8.40e-049 ***

Mean dependent var  −4.061555   S.D. dependent var   4.385945
Sum squared resid    1285.510   S.E. of regression   2.548033
R-squared            0.664188   Adjusted R-squared   0.662492
F(1, 198)            391.6158   P-value(F)           8.40e-49
Log-likelihood      −469.8470   Akaike criterion     943.6941
Schwarz criterion    950.2907   Hannan-Quinn         946.3636

Step 4: Save and Plot Residuals

In the regression window:

Save → Residuals

Command

series uhat = $uhat
gnuplot uhat --time-series --with-lines
Residuals from the spurious regression

Figure 2:Residuals from the spurious regression

The residuals still show persistent behavior.

Step 5: Check Residual Autocorrelation

Variable → Correlogram

Select uhat.

If you do not see this option, go to:

Data → Dataset structure...

and select:

Time series → Other

You can also right click on uhat and select Correlogram.

Command

corrgm uhat
Correlogram of residuals

Figure 3:Correlogram of residuals

Step 6: ADF Tests

Variable → Unit root tests → Augmented Dickey-Fuller

Test:

Command

adf 0 x
adf 0 y
adf 0 uhat

It is often better to use the menu because GRETL can help determine an appropriate lag length for the ADF test.

Example output for x:

Augmented Dickey-Fuller test for x
testing down from 14 lags, criterion AIC
sample size 199
unit-root null hypothesis: a = 1

  test with constant 
  including 0 lags of (1-L)x
  model: (1-L)y = b0 + (a-1)*y(-1) + e
  estimated value of (a - 1): -0.0224933
  test statistic: tau_c(1) = -1.57309
  asymptotic p-value 0.4964

Example output for y:

Augmented Dickey-Fuller test for y
testing down from 14 lags, criterion AIC
sample size 199
unit-root null hypothesis: a = 1

  test with constant 
  including 0 lags of (1-L)y
  estimated value of (a - 1): -0.0198401
  test statistic: tau_c(1) = -1.22352
  asymptotic p-value 0.6666

Example output for uhat:

Augmented Dickey-Fuller test for uhat
testing down from 14 lags, criterion AIC
sample size 195
unit-root null hypothesis: a = 1

  test with constant 
  including 4 lags of (1-L)uhat
  estimated value of (a - 1): -0.117113
  test statistic: tau_c(1) = -2.93493
  asymptotic p-value 0.04143

Step 7: Difference the Data

Select x and y, right click, and choose:

Add differences

Command

series d_x = diff(x)
series d_y = diff(y)

Step 8: Re-estimate the Regression in Differences

Model → Ordinary Least Squares

Command

ols d_y const d_x

Example output:

Model 2: OLS, using observations 2-200 (T = 199)
Dependent variable: d_y

             coefficient   std. error   t-ratio   p-value
  -------------------------------------------------------
  const      −0.0721970    0.0706927    −1.021    0.3084 
  d_x        −0.0574247    0.0647031    −0.8875   0.3759 

Mean dependent var  −0.068429   S.D. dependent var   0.994910
Sum squared resid    195.2089   S.E. of regression   0.995444
R-squared            0.003982   Adjusted R-squared  -0.001073
F(1, 197)            0.787676   P-value(F)           0.375886
Log-likelihood      −280.4549   Akaike criterion     564.9099
Schwarz criterion    571.4965   Hannan-Quinn         567.5757
rho                  0.000807   Durbin-Watson        1.978732

17.13 Practical Checklist


17.14 Common Mistakes


17.15 Looking Ahead

Spurious regression teaches us a central lesson:

In the next chapter, we introduce dynamic regression models, including distributed lag models and ARDL models.

These models allow us to study how variables affect each other over time.

Key Takeaways

Concept Check

Basic

  1. What is spurious regression?

  2. Why can two unrelated time series appear to be related?

  3. What role does nonstationarity play in spurious regression?


Intuition

  1. Why do random walks tend to drift over time?

  2. How can two independent random walks appear to move together?

  3. Why is a high R2R^2 not reliable evidence of a relationship in time series data?


Intermediate

  1. What assumptions of classical regression are violated with nonstationary data?

  2. Why are t-statistics and p-values unreliable in spurious regressions?

  3. What is the key diagnostic principle involving residuals?


Diagnostics

  1. What does it mean if regression residuals are nonstationary?

  2. Why is residual autocorrelation a warning sign?


Fixing the Problem

  1. Why does differencing often remove spurious relationships?


Challenge

  1. Can a regression in levels ever be valid with nonstationary variables?


Interpretation & Practice

  1. A regression produces:

  1. Two variables trend upward over time.

  1. Residuals from a regression appear highly persistent.

    • What does this imply?

  2. After differencing, the regression coefficient becomes insignificant.

    • What does this suggest about the original relationship?

  3. ADF test fails to reject unit root for residuals.

    • What does this imply?

  4. Suppose inflation and GDP both trend upward.


Challenge

  1. A regression in levels produces stationary residuals.

    • What does this suggest?

    • What concept does this relate to?


Numerical Practice

Random Walk Construction

  1. Suppose:

xt=xt1+etx_t = x_{t-1} + e_t

with shocks:

et=2,1,3e_t = 2, -1, 3

and x0=0x_0 = 0.


Differencing

  1. Using the values above:


Interpretation

  1. Why is Δxt\Delta x_t stationary but xtx_t is not?


Regression Output

  1. Suppose a regression produces:



ADF Interpretation

  1. Suppose you estimate a regression between two time series and obtain the following ADF test results:

SeriesADF p-value
xtx_t0.82
yty_t0.76
residuals0.80

  1. Now consider an alternative case:

SeriesADF p-value
xtx_t0.88
yty_t0.91
residuals0.02



  1. Why is it possible for xtx_t and yty_t to be nonstationary, but the residuals to be stationary?


Model Comparison

  1. Suppose:



Challenge

  1. Suppose you regress two random walks and obtain:



  1. Suppose residuals are stationary.


Appendix 17A — Why Nonstationarity Leads to Spurious Regression

This appendix provides an intuitive but slightly more formal explanation of why regressions involving nonstationary time series can produce misleading results.


A.1 Setup

Consider two independent random walks:

xt=xt1+et,etWN(0,σu2)x_t = x_{t-1} + e_t, \quad e_t \sim \text{WN}(0, \sigma_u^2)
yt=yt1+vt,vtWN(0,σv2)y_t = y_{t-1} + v_t, \quad v_t \sim \text{WN}(0, \sigma_v^2)

Assume:


A.2 Accumulation of Shocks

A random walk can be written as:

xt=s=1tesx_t = \sum_{s=1}^t e_s

so its variance is:

Var(xt)=tσe2\text{Var}(x_t) = t \sigma_e^2

This is the defining feature of nonstationarity.


Because shocks accumulate:

Even though these trends are random, they can look systematic in finite samples.


A.4 The Regression Problem

Now consider the regression:

yt=α+βxt+ety_t = \alpha + \beta x_t + e_t

The OLS estimator is:

β^=xtytxt2\hat{\beta} = \frac{\sum x_t y_t}{\sum x_t^2}

This expression is written in simplified mean-zero form. The same intuition applies when an intercept is included.


A.5 Why the Estimator Misbehaves

Even though xtx_t and yty_t are independent:


A.6 Failure of Standard Inference

Standard regression theory assumes:

But with nonstationary data:


A.7 Residual Behavior

If the regression were meaningful, the residuals should be stationary.

However:

ut=ytα^β^xtu_t = y_t - \hat{\alpha} - \hat{\beta}x_t

often inherits nonstationarity from yty_t and xtx_t.


A.8 Big Picture


A.9 How to Fix the Problem

There are two main approaches: