Chapter 16 — Evaluating Forecasts
In the previous chapter, we learned how forecasts are generated from time series models.
But an important question remains:
Forecasting is not only about producing predictions.
It is also about evaluating forecast quality.
A model that fits historical data very well may still forecast poorly out of sample.
This chapter introduces the most commonly used measures of forecast accuracy and forecast quality.
We will use a real example based on Thai inflation forecasts to illustrate the ideas.
Learning Objectives¶
By the end of this chapter, you should be able to:
compute forecast errors
distinguish bias from variability
interpret MSE, RMSE, MAE, and MAPE
understand the tradeoff between different forecast evaluation measures
interpret Theil’s and statistics
understand forecast error decomposition
compare competing forecasting models
16.1 Forecast Errors¶
Forecast evaluation begins with the forecast error.
Suppose:
actual value:
forecast value:
The forecast error is:
If:
: the model underpredicted
: the model overpredicted
Good evaluation distinguishes:
systematic mistakes (bias)
random fluctuations (variance)
failure to track movements (covariance)
16.2 A Thai Inflation Forecast Example¶
Suppose we compare two competing forecasts of Thailand’s year-on-year CPI inflation.
The table below shows:
actual inflation
Forecast 1 - using AR(1)
Forecast 2 - random walk model
forecast errors
| Date | Actual Inflation | Forecast 1 | Forecast 2 |
|---|---|---|---|
| Jan 2014 | 1.93 | 1.84 | 1.67 |
| Feb 2014 | 1.96 | 1.92 | 1.93 |
| Mar 2014 | 2.11 | 1.96 | 1.96 |
| Apr 2014 | 2.45 | 2.31 | 2.11 |
| May 2014 | 2.62 | 3.61 | 2.45 |
| Jun 2014 | 2.35 | 2.45 | 2.62 |
| Jul 2014 | 2.16 | 2.01 | 2.35 |
| Aug 2014 | 2.09 | 1.99 | 2.16 |
| Sep 2014 | 1.75 | 1.81 | 2.09 |
| Oct 2014 | 1.48 | 1.65 | 1.75 |
| Nov 2014 | 1.26 | 1.21 | 1.48 |
| Dec 2014 | 0.60 | 1.13 | 1.26 |
We now ask:
For data and computations in Excel see LINK.
16.3 Mean Forecast Error (Bias)¶
The simplest measure is the average forecast error.
or equivalently:
positive bias forecasts tend to be too low
negative bias forecasts tend to be too high
For the Thai inflation example:
| Measure | Forecast 1 | Forecast 2 |
|---|---|---|
| Bias | 0.094 | 0.089 |
Both forecasts slightly underpredict inflation on average.
16.4 Mean Squared Error (MSE)¶
One problem with simple bias is that positive and negative errors can cancel out.
To avoid this, we square the errors.
Large mistakes therefore matter disproportionately.
Thai Inflation Example¶
| Measure | Forecast 1 | Forecast 2 |
|---|---|---|
| MSE | 0.116 | 0.084 |
Forecast 2 has the smaller MSE.
So Forecast 2 performs better according to this criterion.
16.5 Root Mean Squared Error (RMSE)¶
MSE is useful mathematically, but its units are squared.
To restore the original units, we take the square root.
For inflation forecasting, RMSE is measured in percentage points of inflation.
Thai Inflation Example¶
| Measure | Forecast 1 | Forecast 2 |
|---|---|---|
| RMSE | 0.341 | 0.290 |
Again, Forecast 2 performs better.
16.6 Why RMSE Is Popular¶
RMSE is one of the most widely used forecast evaluation measures.
Why?
Because:
it penalizes large errors
it is easy to interpret
it preserves original units
Examples include:
inflation forecasting by central banks
electricity demand forecasting
financial risk forecasting
16.7 Mean Absolute Error (MAE)¶
Instead of squaring errors, we can use absolute values.
Unlike RMSE:
MAE penalizes errors linearly
extreme errors matter less
Thai Inflation Example¶
| Measure | Forecast 1 | Forecast 2 |
|---|---|---|
| MAE | 0.214 | 0.247 |
Now Forecast 1 performs better.
This illustrates an important idea:
16.8 RMSE vs MAE¶
RMSE and MAE emphasize different aspects of forecast performance.
| Measure | Sensitive to Large Errors? |
|---|---|
| MAE | Less sensitive |
| RMSE | More sensitive |
RMSE is more sensitive to large errors because squaring amplifies extreme values.
Intuition¶
Suppose one model makes:
many moderate errors
but no extremely large mistakes
Another model makes:
mostly small errors
but occasionally a very large mistake
RMSE may strongly penalize the second model.
MAE may prefer it.
16.9 Percentage Errors¶
Sometimes variables have different scales.
In such cases, percentage-based measures may be useful.
The percentage forecast error is:
16.10 Mean Absolute Percentage Error (MAPE)¶
A common percentage-based measure is:
Thai Inflation Example¶
| Measure | Forecast 1 | Forecast 2 |
|---|---|---|
| MAPE | 14.96 | 19.11 |
Forecast 1 performs better according to MAPE.
16.11 Problems with MAPE¶
MAPE has some important weaknesses.
Problem 1: Division by Small Numbers¶
If actual values are close to zero:
can become extremely large.
Problem 2: Undefined for Zero Values¶
If:
then MAPE is undefined.
Problem 3: Asymmetry¶
MAPE can penalize overpredictions and underpredictions differently.
This is common in:
inflation
growth rates
financial returns
16.12 Relationship Between MSE, RMSE, and Bias¶
Recall:
MSE can be decomposed into:
or approximately:
where:
= standard forecast error
Bias = mean forecast error
16.13 Theil’s Statistic¶
Theil’s is a normalized measure of forecast accuracy.
One common version is:
Properties¶
Thai Inflation Example¶
| Measure | Forecast 1 | Forecast 2 |
|---|---|---|
| Theil’s | 0.084 | 0.071 |
Forecast 2 performs slightly better.
16.14 Theil’s Statistic¶
Theil’s compares a forecast against a benchmark forecast.
Usually the benchmark is a naive forecast:
For example:
tomorrow equals today
next month equals this month
Definition¶
A common form is:
Interpretation¶
| Value | Interpretation |
|---|---|
| model beats naive forecast | |
| same as naive forecast | |
| worse than naive forecast |
Thai Inflation Example¶
| Measure | Forecast 1 | Forecast 2 |
|---|---|---|
| Theil’s | 0.94 | 1.00 |
Forecast 1 slightly outperforms the naive benchmark.
Forecast 2 performs roughly the same as the naive forecast.
16.15 Forecast Error Decomposition¶
Theil also proposed decomposing forecast errors into components.
A common decomposition separates MSE into:
bias proportion
variance proportion
covariance proportion
Bias Proportion¶
Measures systematic differences between forecast mean and actual mean.
Variance Proportion¶
Measures differences in variability.
Covariance Proportion¶
Measures unsystematic error.
Thai Inflation Example¶
Forecast 1¶
| Component | Value |
|---|---|
| Bias proportion | 0.076 |
| Variance proportion | 0.040 |
| Covariance proportion | 0.885 |
Most errors are unsystematic.
This is generally a good sign.
Forecast 2¶
| Component | Value |
|---|---|
| Bias proportion | 0.094 |
| Variance proportion | 0.300 |
| Covariance proportion | 0.624 |
Forecast 2 has a larger variance component.
This suggests that the model may not adequately capture changes in volatility or variability.
16.16 Understanding the Sources of Forecast Error¶
An alternative way to understand forecast errors is to decompose MSE into three components:
where:
is the standard deviation of the actual series
is the standard deviation of the forecast series
is the correlation between actual and forecast values
Bias Component¶
The first term:
captures systematic overprediction or underprediction.
Variability Component¶
The second term:
measures whether the forecast is too volatile or too smooth relative to the actual data.
A forecast may correctly predict the average level while still failing to match the fluctuations of the series.
Covariance Component¶
The final term:
captures failure to track movements in the actual series.
If the correlation between forecasts and actual values is high, this component becomes small.
16.17 Choosing Between Forecasts¶
Which forecast is “best”?
There is no universal answer.
Different measures emphasize different aspects of performance.
| Criterion | Forecast 1 Better? | Forecast 2 Better? |
|---|---|---|
| MSE | ✓ | |
| RMSE | ✓ | |
| MAE | ✓ | |
| MAPE | ✓ | |
| Theil’s | ✓ | |
| Theil’s | ✓ |
16.18 Why Forecast Evaluation Matters¶
Forecast evaluation is central in:
monetary policy
financial trading
inventory management
energy demand forecasting
macroeconomic planning
A model that performs well historically may fail during:
crises
structural breaks
regime changes
periods of unusual volatility
16.19 Python Example: Comparing Forecast Accuracy¶
import numpy as np
import pandas as pd
actual = np.array([2.11,2.45,2.62,2.35,2.16,2.09,1.75,1.48,1.26,0.60])
f1 = np.array([1.96,2.31,3.61,2.45,2.01,1.99,1.81,1.65,1.21,1.13])
f2 = np.array([1.96,2.11,2.45,2.62,2.35,2.16,2.09,1.75,1.48,1.26])
e1 = actual - f1
e2 = actual - f2
def forecast_stats(errors, actual):
mse = np.mean(errors**2)
rmse = np.sqrt(mse)
mae = np.mean(np.abs(errors))
mape = np.mean(np.abs(errors/actual))*100
bias = np.mean(errors)
return pd.Series({
"Bias": bias,
"MSE": mse,
"RMSE": rmse,
"MAE": mae,
"MAPE": mape
})
results = pd.DataFrame({
"Forecast 1": forecast_stats(e1, actual),
"Forecast 2": forecast_stats(e2, actual)
})
print(results.round(3)) Forecast 1 Forecast 2
Bias -0.126 -0.136
MSE 0.138 0.095
RMSE 0.372 0.309
MAE 0.244 0.268
MAPE 17.381 21.62416.20 Gretl Example: Forecast Evaluation¶
GRETL provides several forecast evaluation tools.
After generating forecasts:
Model window → Analysis → Forecast evaluation
depending on the GRETL version.
GRETL may report:
MSE
RMSE
MAE
MAPE
Theil statistics
[GRETL Screenshot Placeholder: Forecast evaluation statistics]Comparing Competing Models¶
You can compare:
AR models
ARIMA models
VAR forecasts
naive forecasts
using the same forecast sample.
16.21 Common Mistakes¶
16.22 Looking Ahead¶
So far, we have focused mainly on forecasting individual time series.
In the next part of the book, we move to relationships between time series.
We will study:
spurious regression
dynamic relationships
Granger causality
cointegration
error correction models
Key Takeaways¶
Concept Check¶
Basic¶
What is a forecast error?
What does it mean if a forecast error is positive?
What is bias in forecasting?
Intuition¶
Why can a model with zero bias still perform poorly?
Why do we square forecast errors in MSE?
Why might large forecast errors be particularly important in practice?
Measures¶
What is the difference between:
MSE
RMSE
MAE
Why is RMSE often preferred to MSE?
What does MAE measure differently from RMSE?
Percentage Errors¶
What is MAPE?
Why might MAPE be misleading when values are close to zero?
Theil Measures¶
What is the intuition behind Theil’s ?
What does Theil’s compare?
Challenge¶
Why might different evaluation measures rank forecasts differently?
Interpretation & Practice¶
A model has:
low bias
very high RMSE
What does this imply?
A model has:
low MAE
high RMSE
What does this suggest about the distribution of errors?
Two models produce similar RMSE, but one has lower MAE.
What might this indicate?
A model performs worse than a naive forecast.
What does this imply?
Forecast errors are consistently positive.
What does this indicate about the model?
Thai Inflation Example¶
In the example:
Forecast 2 performs better under RMSE
Forecast 1 performs better under MAE
Why might this happen?
Forecast 2 has higher variance proportion.
What does this suggest?
Challenge¶
A model has excellent in-sample fit but poor out-of-sample performance.
What might be happening?
Numerical Practice¶
Forecast Errors¶
Suppose:
actual:
forecast:
Compute the forecast error.
Bias¶
Suppose forecast errors are:
Compute the bias.
MSE and RMSE¶
Using the same errors:
Compute MSE
Compute RMSE
MAE¶
Compute MAE for the same data.
Interpretation¶
Compare RMSE and MAE.
Which is larger?
Why?
MAPE¶
Suppose:
actual:
forecast:
Compute percentage error.
Suppose .
Why might MAPE become unreliable?
Model Comparison¶
Suppose:
| Model | RMSE | MAE |
|---|---|---|
| A | 2.5 | 2.0 |
| B | 2.0 | 2.2 |
Which is better under RMSE?
Which is better under MAE?
Theil’s U2¶
Suppose:
Model RMSE = 1.5
Naive RMSE = 2.0
Compute interpretation of .
Challenge¶
Suppose a model minimizes MSE but performs poorly in practice.
Why might this happen?
You are forecasting inflation.
Model A minimizes RMSE
Model B minimizes MAE
Which would you choose if:
large mistakes are very costly?
average accuracy matters more?