3 The Multiple Regression Model

The simple regression model is powerful but often insufficient. Economic relationships are rarely driven by a single factor. The multiple regression model allows us to quantify the relationship between a dependent variable and several independent variables simultaneously, which is crucial for tackling omitted variable bias.

3.1 The Trivariate Model & Interpretation

The population multiple regression model with two explanatory variables (the trivariate model) is written as:

\[Y_i = \alpha + \beta_1 X_{1i} + \beta_2 X_{2i} + u_i\]

Taking expectations on both sides gives the conditional mean function:

\[E(Y_i | X_{1i}, X_{2i}) = \alpha + \beta_1 X_{1i} + \beta_2 X_{2i}\]

The key advantage of multiple regression is the ceteris paribus (all else equal) interpretation of its coefficients.

\(\beta_1\) is the marginal effect of \(X_1\) on \(Y\). It is the effect of a small change in the \(X_1\) on the dependent variable, while holding \(X_2\) constant.
\(\beta_2\) is the marginal effect of \(X_2\) on \(Y\), while holding \(X_1\) constant.

This holding-constant effect is what helps isolate the direct effect of one variable, controlling for the influence of others.

3.2 The OLS Estimators and Normal Equations

The sample regression function (SRF) is:

\[\hat{Y}_i = \hat{\alpha} + \hat{\beta}_1 X_{1i} + \hat{\beta}_2 X_{2i}\] \[Y_i = \hat{\alpha} + \hat{\beta}_1 X_{1i} + \hat{\beta}_2 X_{2i} + \hat{u}_i\]

As in the bivariate case, the OLS procedure consists of choosing the unknown parameters \((\hat{\alpha}, \hat{\beta}_1, \hat{\beta}_2)\) such that the residual sum of squares (SSR) is minimized:

\[\min_{\hat{\alpha}, \hat{\beta}_1, \hat{\beta}_2} \sum_{i=1}^n \hat{u}_i^2 = \min_{\hat{\alpha}, \hat{\beta}_1, \hat{\beta}_2} \sum_{i=1}^n (Y_i - \hat{\alpha} - \hat{\beta}_1 X_{1i} - \hat{\beta}_2 X_{2i})^2\]

Differentiating and setting the derivatives to zero yields the system of normal equations:

\[\begin{aligned} \sum Y_i &= n\hat{\alpha} + \hat{\beta}_1 \sum X_{1i} + \hat{\beta}_2 \sum X_{2i} \\ \sum X_{1i}Y_i &= \hat{\alpha} \sum X_{1i} + \hat{\beta}_1 \sum X_{1i}^2 + \hat{\beta}_2 \sum X_{1i}X_{2i} \\ \sum X_{2i}Y_i &= \hat{\alpha} \sum X_{2i} + \hat{\beta}_1 \sum X_{1i}X_{2i} + \hat{\beta}_2 \sum X_{2i}^2 \end{aligned}\]

Solving this system of three equations provides the formulas for the OLS estimators \(\hat{\alpha}\), \(\hat{\beta}_1\), and \(\hat{\beta}_2\). While the formulas are more complex than in the simple regression case, the intuition is similar.

Hence, the OLS estimators for the slope coefficients in a multiple regression model \(Y_i = \alpha + \beta_1 X_{1i} + \beta_2 X_{2i} + u_i\) are given by:

\[ \begin{align*} \hat{\beta}_1&=\frac{(\sum_{i=1}^n x_{1i}y_i)(\sum_{i=1}^n x_{2i}^2)-(\sum_{i=1}^n x_{2i}y_i)(\sum_{i=1}^n x_{1i}x_{2i})}{(\sum_{i=1}^n x_{1i}^2)(\sum_{i=1}^n x_{2i}^2)-(\sum_{i=1}^n x_{1i}x_{2i})^2}\\ \hat{\beta}_2&=\frac{(\sum_{i=1}^n x_{2i}y_i)(\sum_{i=1}^n x_{1i}^2)-(\sum_{i=1}^n x_{1i}y_i)(\sum_{i=1}^n x_{1i}x_{2i})}{(\sum_{i=1}^n x_{1i}^2)(\sum_{i=1}^n x_{2i}^2)-(\sum_{i=1}^n x_{1i}x_{2i})^2} \end{align*} \] And the estimator for the intercept is:

\[ \hat\alpha = \bar Y - \hat\beta_1 \bar X_1 - \hat\beta_2 \bar X_2 \]

3.2.1 Extending the `mtcars` Model

We can estimate the coefficients easily using R. Let’s first expand our car mileage model. Perhaps a car’s horsepower (hp) also affects its fuel efficiency (mpg), in addition to its weight (wt). We can estimate this trivariate model.

# Estimate a multiple regression model
# mpg = alpha + beta1 * wt + beta2 * hp + u

multi_model <- lm(mpg ~ wt + hp, data = mtcars)

# Print the summary results
summary(multi_model)


Call:
lm(formula = mpg ~ wt + hp, data = mtcars)

Residuals:
   Min     1Q Median     3Q    Max 
-3.941 -1.600 -0.182  1.050  5.854 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 37.22727    1.59879  23.285  < 2e-16 ***
wt          -3.87783    0.63273  -6.129 1.12e-06 ***
hp          -0.03177    0.00903  -3.519  0.00145 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.593 on 29 degrees of freedom
Multiple R-squared:  0.8268,    Adjusted R-squared:  0.8148 
F-statistic: 69.21 on 2 and 29 DF,  p-value: 9.109e-12

Interpretation of the Coefficients:

wt coefficient (-3.878): Holding horsepower constant, a one-ton increase in weight is associated with a decrease in fuel efficiency of approximately 3.88 miles per gallon, on average.
hp coefficient (-0.032): Holding weight constant, a one-horsepower increase is associated with a decrease in fuel efficiency of approximately 0.032 miles per gallon, on average.
Intercept (37.227): The predicted miles per gallon for a car that weighs 0 tons and has 0 horsepower. (Note: This is often not a meaningful value and serves just as a baseline for the regression line.)

Notice how the coefficient on wt changed from -5.34 in the simple regression to -3.88 in this multiple regression. This suggests that horsepower was an omitted variable that was correlated with both weight and mileage, and failing to control for it biased our initial estimate of the effect of weight.

3.3 Standard Errors of the OLS Estimates

The formula for the variance of a slope estimator, say \(\hat{\beta}_1\), in the trivariate model is:

\[Var(\hat{\beta}_1) = \frac{\sigma^2}{\sum (X_{1i} - \bar{X}_1)^2 (1 - r^2_{12})}\]

\(\sigma^2\) is the variance of the error term \(u_i\).
\(\sum (X_{1i} - \bar{X}_1)^2\) is the total variation in \(X_1\).
\(r_{12}\) is the sample correlation between \(X_1\) and \(X_2\).

Since we never observe the true error variance (\(\sigma^2\)), we bootstrap and use its estimate:

\[\hat{\sigma}^2 = \frac{1}{n - k} \sum_{i=1}^n \hat{u}_i^2\]

where \(k\) is the number of estimated coefficients (including the constant, so \(k=3\) for our model). The standard error of the coefficient is then the square root of its estimated variance: \(SE(\hat{\beta}_1) = \sqrt{\widehat{Var}(\hat{\beta}_1)}\).

Hence, the standard errors for \(\beta 's\) in the model \(Y_i = \alpha + \beta_1 X_{1i} + \beta_2 X_{2i} + u_i\) are given by:

\[ \begin{aligned} SE(\hat{\beta}_1) &= \sqrt{\widehat{Var}(\hat{\beta}_1)} \quad \text{or} \quad \hat{\sigma} \frac{1}{\sqrt{\sum x_{1i}^2 (1 - r_{12}^2)}} \\ SE(\hat{\beta}_2) &= \sqrt{\widehat{Var}(\hat{\beta}_2)} \text{or} \quad \hat{\sigma} \frac{1}{\sqrt{\sum x_{2i}^2 (1 - r_{12}^2)}} \end{aligned} \]

The standard error for the intercept is:

\[ SE(\hat{\alpha}) = \sqrt{\widehat{Var}(\hat{\alpha})} = \hat{\sigma} \sqrt{ \frac{1}{n} + \frac{\overline{X}_1^2 \sum x_{2i}^2 + \overline{X}_2^2 \sum x_{1i}^2 - 2\overline{X}_1 \overline{X}_2 \sum x_{1i} x_{2i}}{\sum x_{1i}^2 \sum x_{2i}^2 - (\sum x_{1i} x_{2i})^2} } \]

where:

\(\hat{\sigma} = \sqrt{\frac{\sum \hat{u}_i^2}{n - k}}\) is the standard error of the regression (\(SER\)),
\(r_{12}\) is the sample correlation between \(X_1\) and \(X_2\),
\(x_{1i} = X_{1i} - \bar{X}_1\) and \(x_{2i} = X_{2i} - \bar{X}_2\) are deviations

The summary() function in R displays these standard errors, as seen in the output above.

3.4 R-squared and The Adjusted R-squared

As in the bivariate case, the coefficient of determination, \(R^2\), is the fraction of the sample variation in \(Y\) explained by the model:

\[R^2 = \frac{SSE}{SST} = 1 - \frac{SSR}{SST}\]

A key feature of multiple regression is that adding any new variable (even an irrelevant one) will never decrease the \(R^2\). Because of this, we often prefer the adjusted R-squared, which penalizes for adding irrelevant variables.

\[\bar{R}^2 = 1 - \frac{SSR/(n-k)}{SST/(n-1)} = 1 - \left( \frac{n-1}{n-k} \right) \frac{SSR}{SST}\]

\(n\) is the sample size.
\(k\) is the number of coefficients, including the constant.
\(\bar{R}^2\) can decrease if a new variable adds little explanatory power, providing a better gauge of whether a variable should be included.
Always compare \(\bar{R}^2\), not \(R^2\), when models have a different number of predictors.

In our mtcars output, we see both R-squared: 0.8268 and Adjusted R-squared: 0.8148.

3.4.1 Interpreting \(R^2\) and \(\bar{R}^2\) in Practice

It is critical to remember that: 1. An increase in the \(R^2\) or \(\bar{R}^2\) does not necessarily mean that an added variable is statistically significant. 2. A high \(R^2\) or \(\bar{R}^2\) does not mean that the regressors are a true cause of the dependent variable (causation vs. correlation). 3. A high \(R^2\) or \(\bar{R}^2\) does not mean that there is no omitted variable bias. 4. A high \(R^2\) or \(\bar{R}^2\) does not necessarily mean that you have the most appropriate set of regressors, nor does a low \(R^2\) or \(\bar{R}^2\) mean that you have an inappropriate model.

3.5 Hypothesis Testing

3.5.1 Testing Individual Coefficients

The procedure for testing hypotheses about a single coefficient is identical to the simple regression case. For example, to test if \(X_1\) has a significant effect on \(Y\) after controlling for \(X_2\):

\(H_0: \beta_1 = 0\)
\(H_1: \beta_1 \neq 0\)

The test statistic is the t-ratio: \(t = \displaystyle\frac{\hat{\beta}_1}{SE(\hat{\beta}_1)}\).

As a rule of thumb, we reject the null hypothesis if the absolute value of the t-ratio is greater than “2”.

3.5.2 Testing the Overall Significance: The F-Test

The test for the overall significance of the regression is a joint test that all slope coefficients are equal to zero:

\[ H_0: \beta_1 = \beta_2 = 0 \quad \text{vs.} \quad H_1: \text{at least one } \beta_j \neq 0 \]

This test is conducted using the F-statistic. The F-statistic can be computed in several equivalent forms:

1. Using sums of squares and cross-products: \[ F = \frac{ \left( \hat{\beta}_1 \sum y_i x_{1i} + \hat{\beta}_2 \sum y_i x_{2i} \right) / m }{ \sum \hat{u}_i^2 / (n - k) } \] where

\(m\) is the number of restrictions
\(n-k\) is the degree of freedom of the regression

The above is the ratio of explained and unexplained sums of squares divided by their respective degrees of freedom which corresponds to ANOVA:

\[ F = \frac{SSE / (k - 1)}{SSR / (n - k)} \] where \(k\) is the total number of estimated parameters (for a model with two regressors, \(k=3\): \(\alpha\), \(\beta_1\), and \(\beta_2\)).

Note this is the same if we use the coefficient of determination (\(R^2\)):

\[ F = \frac{R^2 / (k - 1)}{(1 - R^2) / (n - k)} \]

Hence, the new test in multiple regression is the test for overall significance of the regression, which is a joint test that all slope coefficients are simultaneously equal to zero.

\(H_0: \beta_1 = \beta_2 = 0\)
\(H_1: \text{At least one } \beta_j \neq 0\)

This test is given by the F-statistic, which is reported in the standard regression output. The F-statistic is constructed using the sums of squares from the ANOVA (Analysis of Variance) framework:

\[SST = SSE + SSR\]

The F-statistic is the ratio of the explained to unexplained variance, adjusted for degrees of freedom:

\[F = \frac{SSE / (k-1)}{SSR / (n-k)} = \frac{MSR}{MSE}\]

where \(k-1\) is the number of slope coefficients. A large F-statistic provides evidence against the null hypothesis that the model provides no better fit than a model with only an intercept.

3.5.3 Joint Tests and Restricted Models

The F-test can be generalized to test any set of linear restrictions. For example, we can test if a subset of coefficients is equal to zero.

Assume the unrestricted model is:

\[Y_i = \alpha + \beta_1 X_{1i} + \beta_2 X_{2i} + \beta_3 X_{3i} + u_i\]

We could test the joint hypothesis:

\(H_0: \beta_2 = 0, \beta_3 = 0\)
\(H_1: H_0 \text{ is not true}\)

This involves estimating a restricted model where the restrictions under the null are imposed: \[Y_i = \alpha + \beta_1 X_{1i} + u_i\]

The general F-statistic formula for testing \(m\) restrictions is:

\[F = \frac{(SSR_r - SSR_{ur}) / m}{SSR_{ur} / (n - k)}\]

where:

\(SSR_r\) is the sum of squared residuals from the restricted model.
\(SSR_{ur}\) is the sum of squared residuals from the unrestricted model.
\(m\) is the number of restrictions.
\(n - k\) is the degrees of freedom in the unrestricted model.

Here is another example where We could test the joint hypothesis:

\(H_0: \beta_1 = 0, \beta_2 + \beta_3 = 1\)
\(H_1: H_0 \text{ is not true}\)

Here, the unrestricted model is the same as before:

\[Y_i = \alpha + \beta_1 X_{1i} + \beta_2 X_{2i} + \beta_3 X_{3i} + u_i\]

To find the restricted model, we impose the null hypothesis:

Substitute \(\beta_1 = 0\) and \(\beta_3 = 1 - \beta_2\) into the model:

\[Y_i = \alpha + \beta_2 X_{2i} + (1 - \beta_2) X_{3i} + u_i\]

Rearrange the equation:

\[Y_i = \alpha + X_{3i} + \beta_2 (X_{2i} - X_{3i}) + u_i\]

Define a new dependent variable \(Y_i^* = Y_i - X_{3i}\):

\[Y_i^* = \alpha + \beta_2 (X_{2i} - X_{3i}) + u_i\]

We then perform the F-test as usual and a low p-value would lead to a rejection of the null hypothesis \(H_0\). Note here that \(m\) the number of restrictions is 2 and not 3!

3.6 The Problem of Omitted Variable Bias

Often, we are interested in understanding the relation between two variables (\(Y\) and \(X\)). But running a simple regression of \(Y\) on \(X\) might not be enough. The assumption \(Cov(X, u)=0\) might be violated if the error term \(u\) contains a variable that is correlated with \(X\), thereby introducing a bias. Multiple regression is a primary tool to help resolve this endogeneity problem.

3.6.1 Direction of the Bias

Omitting an important variable introduces a bias to the OLS estimator. The direction of this bias can be formalized.

Suppose the true model is: \[Y_i = \alpha + \beta_1 X_{1i} + \beta_2 X_{2i} + u_i\] But we incorrectly estimate the misspecified model: \[Y_i = \alpha + \beta_1 X_{1i} + v_i \quad \text{where} \quad v_i = \beta_2 X_{2i} + u_i\]

The bias in the simple regression estimator \(\tilde{\beta}_1\) is: \[Bias(\tilde{\beta}_1) = E[\tilde{\beta}_1] - \beta_1 = \beta_2 \cdot \tilde{\delta}_1\]

where \(\tilde{\delta}_1\) is the slope coefficient from an auxiliary regression of the omitted variable (\(X_2\)) on the included variable (\(X_1\)): \[X_{2i} = \delta_0 + \delta_1 X_{1i} + e_i\]

The sign of the bias depends on the signs of \(\beta_2\) (the effect of the omitted variable on \(Y\)) and \(\tilde{\delta}_1\) (the correlation between \(X_2\) and \(X_1\)).
The size of the bias depends on the magnitude of \(\beta_2\) and \(\tilde{\delta}_1\).

3.7 The Cost of Including an Irrelevant Variable

Conversely, including a variable that does not belong in the true model (i.e., whose true coefficient is zero) has different consequences.

Suppose the true model is: \[Y_i = \alpha + \beta_1 X_{1i} + u_i\] But we incorrectly estimate: \[Y_i = \alpha + \beta_1 X_{1i} + \beta_2 X_{2i} + v_i\]

Does it introduce bias? No. The OLS estimators for all coefficients, including \(\hat{\beta}_1\), remain unbiased.
What is the cost? Increased variance. The estimates become less precise. The standard error of \(\hat{\beta}_1\) will generally be larger than it would be in the correctly specified simple regression model, leading to less powerful hypothesis tests and wider confidence intervals.

The trade-off is clear: omitting a relevant variable causes bias, while including an irrelevant variable reduces efficiency. When in doubt, it is often less harmful to include a potentially irrelevant variable than to omit a potentially relevant one.