Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Chapter 16 — Evaluating Forecasts

In the previous chapter, we learned how forecasts are generated from time series models.

But an important question remains:

Forecasting is not only about producing predictions.

It is also about evaluating forecast quality.

A model that fits historical data very well may still forecast poorly out of sample.

This chapter introduces the most commonly used measures of forecast accuracy and forecast quality.

We will use a real example based on Thai inflation forecasts to illustrate the ideas.


Learning Objectives

By the end of this chapter, you should be able to:


16.1 Forecast Errors

Forecast evaluation begins with the forecast error.

Suppose:

The forecast error is:

et=xtx^te_t = x_t - \hat{x}_t

If:

Good evaluation distinguishes:


16.2 A Thai Inflation Forecast Example

Suppose we compare two competing forecasts of Thailand’s year-on-year CPI inflation.

The table below shows:

DateActual InflationForecast 1Forecast 2
Jan 20141.931.841.67
Feb 20141.961.921.93
Mar 20142.111.961.96
Apr 20142.452.312.11
May 20142.623.612.45
Jun 20142.352.452.62
Jul 20142.162.012.35
Aug 20142.091.992.16
Sep 20141.751.812.09
Oct 20141.481.651.75
Nov 20141.261.211.48
Dec 20140.601.131.26

We now ask:

For data and computations in Excel see LINK.


16.3 Mean Forecast Error (Bias)

The simplest measure is the average forecast error.

Bias=1Tt=1Tet\text{Bias} = \frac{1}{T} \sum_{t=1}^{T} e_t

or equivalently:

Bias=1Tt=1T(xtx^t)\text{Bias} = \frac{1}{T} \sum_{t=1}^{T} (x_t - \hat{x}_t)

For the Thai inflation example:

MeasureForecast 1Forecast 2
Bias0.0940.089

Both forecasts slightly underpredict inflation on average.


16.4 Mean Squared Error (MSE)

One problem with simple bias is that positive and negative errors can cancel out.

To avoid this, we square the errors.

MSE=1Tt=1Tet2\text{MSE} = \frac{1}{T} \sum_{t=1}^{T} e_t^2

Large mistakes therefore matter disproportionately.

Thai Inflation Example

MeasureForecast 1Forecast 2
MSE0.1160.084

Forecast 2 has the smaller MSE.

So Forecast 2 performs better according to this criterion.


16.5 Root Mean Squared Error (RMSE)

MSE is useful mathematically, but its units are squared.

To restore the original units, we take the square root.

RMSE=1Tt=1Tet2\text{RMSE} = \sqrt{ \frac{1}{T} \sum_{t=1}^{T} e_t^2 }

For inflation forecasting, RMSE is measured in percentage points of inflation.

Thai Inflation Example

MeasureForecast 1Forecast 2
RMSE0.3410.290

Again, Forecast 2 performs better.


RMSE is one of the most widely used forecast evaluation measures.

Why?

Because:

Examples include:


16.7 Mean Absolute Error (MAE)

Instead of squaring errors, we can use absolute values.

MAE=1Tt=1Tet\text{MAE} = \frac{1}{T} \sum_{t=1}^{T} |e_t|

Unlike RMSE:

Thai Inflation Example

MeasureForecast 1Forecast 2
MAE0.2140.247

Now Forecast 1 performs better.

This illustrates an important idea:


16.8 RMSE vs MAE

RMSE and MAE emphasize different aspects of forecast performance.

MeasureSensitive to Large Errors?
MAELess sensitive
RMSEMore sensitive

RMSE is more sensitive to large errors because squaring amplifies extreme values.

Intuition

Suppose one model makes:

Another model makes:

RMSE may strongly penalize the second model.

MAE may prefer it.


16.9 Percentage Errors

Sometimes variables have different scales.

In such cases, percentage-based measures may be useful.

The percentage forecast error is:

Percentage Errort=100×etxt\text{Percentage Error}_t = 100 \times \frac{e_t}{x_t}

16.10 Mean Absolute Percentage Error (MAPE)

A common percentage-based measure is:

MAPE=100Tt=1Tetxt\text{MAPE} = \frac{100}{T} \sum_{t=1}^{T} \left| \frac{e_t}{x_t} \right|

Thai Inflation Example

MeasureForecast 1Forecast 2
MAPE14.9619.11

Forecast 1 performs better according to MAPE.


16.11 Problems with MAPE

MAPE has some important weaknesses.

Problem 1: Division by Small Numbers

If actual values are close to zero:

etxt\frac{e_t}{x_t}

can become extremely large.

Problem 2: Undefined for Zero Values

If:

xt=0x_t = 0

then MAPE is undefined.

Problem 3: Asymmetry

MAPE can penalize overpredictions and underpredictions differently.

This is common in:


16.12 Relationship Between MSE, RMSE, and Bias

Recall:

MSE=1Tet2\text{MSE} = \frac{1}{T} \sum e_t^2

MSE can be decomposed into:

MSE=Variance of Errors+(Bias)2\text{MSE} = \text{Variance of Errors} + (\text{Bias})^2

or approximately:

MSE=SE2+Bias2\text{MSE} = SE^2 + \text{Bias}^2

where:


16.13 Theil’s U1U_1 Statistic

Theil’s U1U_1 is a normalized measure of forecast accuracy.

One common version is:

U1=1T(x^txt)21Tx^t2+1Txt2U_1 = \frac{ \sqrt{ \frac{1}{T} \sum (\hat{x}_t - x_t)^2 } }{ \sqrt{ \frac{1}{T} \sum \hat{x}_t^2 } + \sqrt{ \frac{1}{T} \sum x_t^2 } }

Properties

Thai Inflation Example

MeasureForecast 1Forecast 2
Theil’s U1U_10.0840.071

Forecast 2 performs slightly better.


16.14 Theil’s U2U_2 Statistic

Theil’s U2U_2 compares a forecast against a benchmark forecast.

Usually the benchmark is a naive forecast:

x^t=xt1\hat{x}_t = x_{t-1}

For example:

Definition

A common form is:

U2=1T(x^txtxt1)21T(xtxt1xt1)2U_2 = \sqrt{ \frac{ \frac{1}{T} \sum \left( \frac{\hat{x}_t - x_t}{x_{t-1}} \right)^2 }{ \frac{1}{T} \sum \left( \frac{x_t - x_{t-1}}{x_{t-1}} \right)^2 } }

Interpretation

ValueInterpretation
U2<1U_2 < 1model beats naive forecast
U2=1U_2 = 1same as naive forecast
U2>1U_2 > 1worse than naive forecast

Thai Inflation Example

MeasureForecast 1Forecast 2
Theil’s U2U_20.941.00

Forecast 1 slightly outperforms the naive benchmark.

Forecast 2 performs roughly the same as the naive forecast.


16.15 Forecast Error Decomposition

Theil also proposed decomposing forecast errors into components.

A common decomposition separates MSE into:

Bias Proportion

Measures systematic differences between forecast mean and actual mean.

Variance Proportion

Measures differences in variability.

Covariance Proportion

Measures unsystematic error.

Thai Inflation Example

Forecast 1

ComponentValue
Bias proportion0.076
Variance proportion0.040
Covariance proportion0.885

Most errors are unsystematic.

This is generally a good sign.

Forecast 2

ComponentValue
Bias proportion0.094
Variance proportion0.300
Covariance proportion0.624

Forecast 2 has a larger variance component.

This suggests that the model may not adequately capture changes in volatility or variability.


16.16 Understanding the Sources of Forecast Error

An alternative way to understand forecast errors is to decompose MSE into three components:

MSE=BIAS2+(sxsx^)2+2(1r)sxsx^\text{MSE} = \text{BIAS}^2 + (s_x - s_{\hat x})^2 + 2(1-r)s_x s_{\hat x}

where:

Bias Component

The first term:

BIAS2\text{BIAS}^2

captures systematic overprediction or underprediction.

Variability Component

The second term:

(sxsx^)2(s_x - s_{\hat x})^2

measures whether the forecast is too volatile or too smooth relative to the actual data.

A forecast may correctly predict the average level while still failing to match the fluctuations of the series.

Covariance Component

The final term:

2(1r)sxsx^2(1-r)s_x s_{\hat x}

captures failure to track movements in the actual series.

If the correlation between forecasts and actual values is high, this component becomes small.


16.17 Choosing Between Forecasts

Which forecast is “best”?

There is no universal answer.

Different measures emphasize different aspects of performance.

CriterionForecast 1 Better?Forecast 2 Better?
MSE
RMSE
MAE
MAPE
Theil’s U1U_1
Theil’s U2U_2

16.18 Why Forecast Evaluation Matters

Forecast evaluation is central in:

A model that performs well historically may fail during:


16.19 Python Example: Comparing Forecast Accuracy

import numpy as np
import pandas as pd

actual = np.array([2.11,2.45,2.62,2.35,2.16,2.09,1.75,1.48,1.26,0.60])

f1 = np.array([1.96,2.31,3.61,2.45,2.01,1.99,1.81,1.65,1.21,1.13])

f2 = np.array([1.96,2.11,2.45,2.62,2.35,2.16,2.09,1.75,1.48,1.26])

e1 = actual - f1
e2 = actual - f2

def forecast_stats(errors, actual):
    
    mse = np.mean(errors**2)
    rmse = np.sqrt(mse)
    mae = np.mean(np.abs(errors))
    mape = np.mean(np.abs(errors/actual))*100
    bias = np.mean(errors)
    
    return pd.Series({
        "Bias": bias,
        "MSE": mse,
        "RMSE": rmse,
        "MAE": mae,
        "MAPE": mape
    })

results = pd.DataFrame({
    "Forecast 1": forecast_stats(e1, actual),
    "Forecast 2": forecast_stats(e2, actual)
})

print(results.round(3))
      Forecast 1  Forecast 2
Bias      -0.126      -0.136
MSE        0.138       0.095
RMSE       0.372       0.309
MAE        0.244       0.268
MAPE      17.381      21.624

16.20 Gretl Example: Forecast Evaluation

GRETL provides several forecast evaluation tools.

After generating forecasts:

Model window → Analysis → Forecast evaluation

depending on the GRETL version.

GRETL may report:


[GRETL Screenshot Placeholder: Forecast evaluation statistics]

Comparing Competing Models

You can compare:

using the same forecast sample.


16.21 Common Mistakes


16.22 Looking Ahead

So far, we have focused mainly on forecasting individual time series.

In the next part of the book, we move to relationships between time series.

We will study:

Key Takeaways

Concept Check

Basic

  1. What is a forecast error?

  2. What does it mean if a forecast error is positive?

  3. What is bias in forecasting?


Intuition

  1. Why can a model with zero bias still perform poorly?

  2. Why do we square forecast errors in MSE?

  3. Why might large forecast errors be particularly important in practice?


Measures

  1. What is the difference between:

    • MSE

    • RMSE

    • MAE

  2. Why is RMSE often preferred to MSE?

  3. What does MAE measure differently from RMSE?


Percentage Errors

  1. What is MAPE?

  2. Why might MAPE be misleading when values are close to zero?


Theil Measures

  1. What is the intuition behind Theil’s U1U_1?

  2. What does Theil’s U2U_2 compare?


Challenge

  1. Why might different evaluation measures rank forecasts differently?


Interpretation & Practice

  1. A model has:

  1. A model has:

  1. Two models produce similar RMSE, but one has lower MAE.

    • What might this indicate?

  2. A model performs worse than a naive forecast.

    • What does this imply?

  3. Forecast errors are consistently positive.

    • What does this indicate about the model?

Thai Inflation Example

  1. In the example:

  1. Forecast 2 has higher variance proportion.

    • What does this suggest?


Challenge

  1. A model has excellent in-sample fit but poor out-of-sample performance.

    • What might be happening?


Numerical Practice

Forecast Errors

  1. Suppose:


Bias

  1. Suppose forecast errors are:

2,1,3,22, -1, 3, -2

MSE and RMSE

  1. Using the same errors:


MAE

  1. Compute MAE for the same data.


Interpretation

  1. Compare RMSE and MAE.


MAPE

  1. Suppose:

  1. Suppose xt=0.1x_t = 0.1.


Model Comparison

  1. Suppose:

ModelRMSEMAE
A2.52.0
B2.02.2

Theil’s U2

  1. Suppose:


Challenge

  1. Suppose a model minimizes MSE but performs poorly in practice.

  1. You are forecasting inflation.

Which would you choose if: