predictorImportance

Estimates of predictor importance for regression tree

Syntax

imp = predictorImportance(tree)

Description

imp = predictorImportance(tree) computes estimates of predictor importance for tree by summing changes in the mean squared error due to splits on every predictor and dividing the sum by the number of branch nodes.

Input Arguments

tree

A regression tree created by fitrtree, or by the compact method.

Output Arguments

imp

A row vector with the same number of elements as the number of predictors (columns) in tree.X. The entries are the estimates of predictor importance, with 0 representing the smallest possible importance.

Examples

expand all

Estimate Predictor Importance

Open Live Script

Estimate the predictor importance for all predictor variables in the data.

Load the carsmall data set.

load carsmall

Grow a regression tree for MPG using Acceleration, Cylinders, Displacement, Horsepower, Model_Year, and Weight as predictors.

X = [Acceleration Cylinders Displacement Horsepower Model_Year Weight];
tree = fitrtree(X,MPG);

Estimate the predictor importance for all predictor variables.

imp = predictorImportance(tree)

imp = 1×6

    0.0647    0.1068    0.1155    0.1411    0.3348    2.6565

Weight, the last predictor, has the most impact on mileage. The predictor with the minimal impact on making predictions is the first variable, which is Acceleration.

Predictor Importance and Surrogate Splits

Open Live Script

Estimate the predictor importance for all variables in the data and where the regression tree contains surrogate splits.

Load the carsmall data set.

load carsmall

Grow a regression tree for MPG using Acceleration, Cylinders, Displacement, Horsepower, Model_Year, and Weight as predictors. Specify to identify surrogate splits.

X = [Acceleration Cylinders Displacement Horsepower Model_Year Weight];
tree = fitrtree(X,MPG,'Surrogate','on');

Estimate the predictor importance for all predictor variables.

imp = predictorImportance(tree)

imp = 1×6

    1.0449    2.4560    2.5570    2.5788    2.0832    2.8938

Comparing imp to the results in Estimate Predictor Importance, Weight still has the most impact on mileage, but Cylinders is the fourth most important predictor.

Unbiased Predictor Importance Estimates

Open Live Script

Load the carsmall data set. Consider a model that predicts the mean fuel economy of a car given its acceleration, number of cylinders, engine displacement, horsepower, manufacturer, model year, and weight. Consider Cylinders, Mfg, and Model_Year as categorical variables.

load carsmall
Cylinders = categorical(Cylinders);
Mfg = categorical(cellstr(Mfg));
Model_Year = categorical(Model_Year);
X = table(Acceleration,Cylinders,Displacement,Horsepower,Mfg,...
    Model_Year,Weight,MPG);

Display the number of categories represented in the categorical variables.

numCylinders = numel(categories(Cylinders))

numCylinders = 3

numMfg = numel(categories(Mfg))

numMfg = 28

numModelYear = numel(categories(Model_Year))

numModelYear = 3

Because there are 3 categories only in Cylinders and Model_Year, the standard CART, predictor-splitting algorithm prefers splitting a continuous predictor over these two variables.

Train a regression tree using the entire data set. To grow unbiased trees, specify usage of the curvature test for splitting predictors. Because there are missing values in the data, specify usage of surrogate splits.

Mdl = fitrtree(X,'MPG','PredictorSelection','curvature','Surrogate','on');

Estimate predictor importance values by summing changes in the risk due to splits on every predictor and dividing the sum by the number of branch nodes. Compare the estimates using a bar graph.

imp = predictorImportance(Mdl);

figure;
bar(imp);
title('Predictor Importance Estimates');
ylabel('Estimates');
xlabel('Predictors');
h = gca;
h.XTickLabel = Mdl.PredictorNames;
h.XTickLabelRotation = 45;
h.TickLabelInterpreter = 'none';

In this case, Displacement is the most important predictor, followed by Horsepower.

More About

expand all

Predictor Importance

predictorImportance computes estimates of predictor importance for tree by summing changes in the mean squared error (MSE) due to splits on every predictor and dividing the sum by the number of branch nodes. If the tree is grown without surrogate splits, this sum is taken over best splits found at each branch node. If the tree is grown with surrogate splits, this sum is taken over all splits at each branch node including surrogate splits. imp has one element for each input predictor in the data used to train this tree. At each node, MSE is estimated as node error weighted by the node probability. Variable importance associated with this split is computed as the difference between MSE for the parent node and the total MSE for the two children.

Estimates of predictor importance do not depend on the order of predictors if you use surrogate splits, but do depend on the order if you do not use surrogate splits.

If you use surrogate splits, predictorImportance computes estimates before the tree is reduced by pruning or merging leaves. If you do not use surrogate splits, predictorImportance computes estimates after the tree is reduced by pruning or merging leaves. Therefore, reducing the tree by pruning affects the predictor importance for a tree grown without surrogate splits, and does not affect the predictor importance for a tree grown with surrogate splits.

Documentation

predictorImportance

Syntax

Description

Input Arguments

Output Arguments

Examples

Estimate Predictor Importance

Predictor Importance and Surrogate Splits

Unbiased Predictor Importance Estimates

More About

Predictor Importance

See Also

Topics

Statistics and Machine Learning Toolbox Documentation

Support