Hat Matrix and Leverage

Hat Matrix

Purpose

The hat matrix provides a measure of leverage. It is useful for investigating whether one or more observations are outlying with regard to their X values, and therefore might be excessively influencing the regression results.

Definition

The hat matrix is also known as the projection matrix because it projects the vector of observations, y, onto the vector of predictions, $\hat{y}$ , thus putting the "hat" on y. The hat matrix H is defined in terms of the data matrix X:

H = X(X^TX)^–1X^T

and determines the fitted or predicted values since

$\hat{y} = H y = X b .$

The diagonal elements of H, h_ii, are called leverages and satisfy

$\begin{array}{l} 0 \leq h_{i i} \leq 1 \\ \sum_{i = 1}^{n} h_{i i} = p, \end{array}$

where p is the number of coefficients, and n is the number of observations (rows of X) in the regression model. HatMatrix is an n-by-n matrix in the Diagnostics table.

How To

After obtaining a fitted model, say, mdl, using fitlm or stepwiselm, you can:

Display the HatMatrix by indexing into the property using dot notation
```
mdl.Diagnostics.HatMatrix
```
When n is large, HatMatrix might be computationally expensive. In those cases, you can obtain the diagonal values directly, using
```
mdl.Diagnostics.Leverage
```

Leverage

Purpose

Leverage is a measure of the effect of a particular observation on the regression predictions due to the position of that observation in the space of the inputs. In general, the farther a point is from the center of the input space, the more leverage it has. Because the sum of the leverage values is p, an observation i can be considered as an outlier if its leverage substantially exceeds the mean leverage value, p/n, for example, a value larger than 2*p/n.

Definition

The leverage of observation i is the value of the ith diagonal term, h_ii, of the hat matrix, H, where

H = X(X^TX)^–1X^T.

The diagonal terms satisfy

$\begin{array}{l} 0 \leq h_{i i} \leq 1 \\ \sum_{i = 1}^{n} h_{i i} = p, \end{array}$

where p is the number of coefficients in the regression model, and n is the number of observations. The minimum value of h_ii is 1/n for a model with a constant term. If the fitted model goes through the origin, then the minimum leverage value is 0 for an observation at x = 0.

It is possible to express the fitted values, $\hat{y}$ , by the observed values, y, since

$\hat{y} = H y = X b .$

Hence, h_ii expresses how much the observation y_i has impact on ${\hat{y}}_{i}$ . A large value of h_ii indicates that the ith case is distant from the center of all X values for all n cases and has more leverage. Leverage is an n-by-1 column vector in the Diagnostics table.

How To

After obtaining a fitted model, say, mdl, using fitlm or stepwiselm, you can:

Display the Leverage vector by indexing into the property using dot notation
```
mdl.Diagnostics.Leverage
```
Plot the leverage for the values fitted by your model using
```
plotDiagnostics(mdl)
```
See the plotDiagnostics method of the LinearModel class for details.

Determine High Leverage Observations

Open Live Script

This example shows how to compute Leverage values and assess high leverage observations. Load the sample data and define the response and independent variables.

load hospital
y = hospital.BloodPressure(:,1);
X = double(hospital(:,2:5));

Fit a linear regression model.

mdl = fitlm(X,y);

Plot the leverage values.

plotDiagnostics(mdl)

For this example, the recommended threshold value is 2*5/100 = 0.1. There is no indication of high leverage observations.

Documentation