TreeBagger

Class: TreeBagger

Create bag of decision trees

Individual decision trees tend to overfit. Bootstrap-aggregated (bagged) decision trees combine the results of many decision trees, which reduces the effects of overfitting and improves generalization. TreeBagger grows the decision trees in the ensemble using bootstrap samples of the data. Also, TreeBagger selects a random subset of predictors to use at each decision split as in the random forest algorithm [1].

By default, TreeBagger bags classification trees. To bag regression trees instead, specify 'Method','regression'.

For regression problems, TreeBagger supports mean and quantile regression (that is, quantile regression forest [5]).

Syntax

Mdl = TreeBagger(NumTrees,Tbl,ResponseVarName)
Mdl = TreeBagger(NumTrees,Tbl,formula)
Mdl = TreeBagger(NumTrees,Tbl,Y)
B = TreeBagger(NumTrees,X,Y)
B = TreeBagger(NumTrees,X,Y,Name,Value)

Description

Mdl = TreeBagger(NumTrees,Tbl,ResponseVarName) returns an ensemble of NumTrees bagged classification trees trained using the sample data in the table Tbl. ResponseVarName is the name of the response variable in Tbl.

Mdl = TreeBagger(NumTrees,Tbl,formula) returns an ensemble of bagged classification trees trained using the sample data in the table Tbl. formula is an explanatory model of the response and a subset of predictor variables in Tbl used to fit Mdl. Specify Formula using Wilkinson notation. For more information, see Wilkinson Notation.

Mdl = TreeBagger(NumTrees,Tbl,Y) returns an ensemble of classification trees using the predictor variables in table Tbl and class labels in vector Y.

Y is an array of response data. Elements of Y correspond to the rows of Tbl. For classification, Y is the set of true class labels. Labels can be any grouping variable, that is, a numeric or logical vector, character matrix, string array, cell array of character vectors, or categorical vector. TreeBagger converts labels to a cell array of character vectors. For regression, Y is a numeric vector. To grow regression trees, you must specify the name-value pair 'Method','regression'.

B = TreeBagger(NumTrees,X,Y) creates an ensemble B of NumTrees decision trees for predicting response Y as a function of predictors in the numeric matrix of training data, X. Each row in X represents an observation and each column represents a predictor or feature.

B = TreeBagger(NumTrees,X,Y,Name,Value) specifies optional parameter name-value pairs:

'InBagFraction'Fraction of input data to sample with replacement from the input data for growing each new tree. Default value is 1.
'Cost'

Square matrix C, where C(i,j) is the cost of classifying a point into class j if its true class is i (i.e., the rows correspond to the true class and the columns correspond to the predicted class). The order of the rows and columns of Cost corresponds to the order of the classes in the ClassNames property of the trained TreeBagger model B.

Alternatively, cost can be a structure S having two fields:

  • S.ClassNames containing the group names as a categorical variable, character array, string array, or cell array of character vectors

  • S.ClassificationCosts containing the cost matrix C

The default value is C(i,j) = 1 if i ~= j, and C(i,j) = 0 if i = j.

If Cost is highly skewed, then, for in-bag samples, the software oversamples unique observations from the class that has a large penalty. For smaller sample sizes, this might cause a very low relative frequency of out-of-bag observations from the class that has a large penalty. Therefore, the estimated out-of-bag error is highly variable, and might be difficult to interpret.

'SampleWithReplacement''on' to sample with replacement or 'off' to sample without replacement. If you sample without replacement, you need to set 'InBagFraction' to a value less than one. Default is 'on'.
'OOBPrediction''on' to store info on what observations are out of bag for each tree. This info can be used by oobPrediction to compute the predicted class probabilities for each tree in the ensemble. Default is 'off'.
'OOBPredictorImportance''on' to store out-of-bag estimates of feature importance in the ensemble. Default is 'off'. Specifying 'on' also sets the 'OOBPrediction' value to 'on'. If an analysis of predictor importance is your goal, then also specify 'PredictorSelection','curvature' or 'PredictorSelection','interaction-curvature'. For more details, see fitctree or fitrtree.
'Method'Either 'classification' or 'regression'. Regression requires a numeric Y.
'NumPredictorsToSample'Number of variables to select at random for each decision split. Default is the square root of the number of variables for classification and one third of the number of variables for regression. Valid values are 'all' or a positive integer. Setting this argument to any valid value but 'all' invokes Breiman's random forest algorithm [1].
'NumPrint'Number of training cycles (grown trees) after which TreeBagger displays a diagnostic message showing training progress. Default is no diagnostic messages.
'MinLeafSize'Minimum number of observations per tree leaf. Default is 1 for classification and 5 for regression.
'Options'

A structure that specifies options that govern the computation when growing the ensemble of decision trees. One option requests that the computation of decision trees on multiple bootstrap replicates uses multiple processors, if the Parallel Computing Toolbox™ is available. Two options specify the random number streams to use in selecting bootstrap replicates. You can create this argument with a call to statset. You can retrieve values of the individual fields with a call to statget. Applicable statset parameters are:

  • 'UseParallel' — If true and if a parpool of the Parallel Computing Toolbox is open, compute decision trees drawn on separate bootstrap replicates in parallel. If the Parallel Computing Toolbox is not installed, or a parpool is not open, computation occurs in serial mode. Default is false, or serial computation.

    For dual-core systems and above, TreeBagger parallelizes training using Intel® Threading Building Blocks (TBB). Therefore, using the 'UseParallel' option is not helpful on a single computer. Use this option on a cluster. For details on Intel TBB, see https://software.intel.com/en-us/intel-tbb.

  • 'UseSubstreams' — If true select each bootstrap replicate using a separate Substream of the random number generator (aka Stream). This option is available only with RandStream types that support Substreams: 'mlfg6331_64' or 'mrg32k3a'. Default is false, do not use a different Substream to compute each bootstrap replicate.

  • Streams — A RandStream object or cell array of such objects. If you do not specify Streams, TreeBagger uses the default stream or streams. If you choose to specify Streams, use a single object except in the case

    • UseParallel is true

    • UseSubstreams is false

    In that case, use a cell array the same size as the Parallel pool.

'Prior'

Prior probabilities for each class. Specify as one of:

  • A character vector or string scalar:

    • 'Empirical' determines class probabilities from class frequencies in Y. If you pass observation weights, they are used to compute the class probabilities. This is the default.

    • 'Uniform' sets all class probabilities equal.

  • A vector (one scalar value for each class). The order of the elements Prior corresponds to the order of the classes in the ClassNames property of the trained TreeBagger model B.

  • A structure S with two fields:

    • S.ClassNames containing the class names as a categorical variable, character array, string array, or cell array of character vectors

    • S.ClassProbs containing a vector of corresponding probabilities

If you set values for both Weights and Prior, the weights are renormalized to add up to the value of the prior probability in the respective class.

If Prior is highly skewed, then, for in-bag samples, the software oversamples unique observations from the class that has a large prior probability. For smaller sample sizes, this might cause a very low relative frequency of out-of-bag observations from the class that has a large prior probability. Therefore, the estimated out-of-bag error is highly variable, and might be difficult to interpret.

'PredictorNames'

Predictor variable names, specified as the comma-separated pair consisting of 'PredictorNames' and a string array or cell array of unique character vectors. The functionality of 'PredictorNames' depends on the way you supply the training data.

  • If you supply X and Y, then you can use 'PredictorNames' to give the predictor variables in X names.

    • The order of the names in PredictorNames must correspond to the column order of X. That is, PredictorNames{1} is the name of X(:,1)PredictorNames{2} is the name of X(:,2), and so on. Also, size(X,2) and numel(PredictorNames) must be equal.

    • By default, PredictorNames is {'x1','x2',...}.

  • If you supply Tbl, then you can use 'PredictorNames' to choose which predictor variables to use in training. That is, TreeBagger uses the predictor variables in PredictorNames and the response only in training.

    • PredictorNames must be a subset of Tbl.Properties.VariableNames and cannot include the name of the response variable.

    • By default, PredictorNames contains the names of all predictor variables.

    • It good practice to specify the predictors for training using one of 'PredictorNames' or formula only.

'CategoricalPredictors'

Categorical predictors list, specified as the comma-separated pair consisting of 'CategoricalPredictors' and one of the following.

  • A numeric vector with indices from 1 to p, where p is the number of columns of X.

  • A logical vector of length p, where a true entry means that the corresponding column of X is a categorical variable.

  • A string array or cell array of character vectors, where each element in the array is the name of a predictor variable. The names must match entries in PredictorNames values.

  • A character matrix, where each row of the matrix is a name of a predictor variable. The names must match entries in PredictorNames values. Pad the names with extra blanks so each row of the character matrix has the same length.

  • 'all', meaning all predictors are categorical.

'ChunkSize'

Chunk size, specified as the comma-separated pair consisting of 'ChunkSize' and a positive integer. The chunk size specifies the number of observations in each chunk of data. The default value is 50000.

Note

This option only applies when using TreeBagger on tall arrays. See Extended Capabilities for more information.

In addition to the optional arguments above, TreeBagger accepts these optional fitctree and fitrtree arguments.

Examples

expand all

Load Fisher's iris data set.

load fisheriris

Train an ensemble of bagged classification trees using the entire data set. Specify 50 weak learners. Store which observations are out of bag for each tree.

rng(1); % For reproducibility
Mdl = TreeBagger(50,meas,species,'OOBPrediction','On',...
    'Method','classification')
Mdl = 
  TreeBagger
Ensemble with 50 bagged decision trees:
                    Training X:              [150x4]
                    Training Y:              [150x1]
                        Method:       classification
                 NumPredictors:                    4
         NumPredictorsToSample:                    2
                   MinLeafSize:                    1
                 InBagFraction:                    1
         SampleWithReplacement:                    1
          ComputeOOBPrediction:                    1
 ComputeOOBPredictorImportance:                    0
                     Proximity:                   []
                    ClassNames:        'setosa'    'versicolor'     'virginica'

  Properties, Methods

Mdl is a TreeBagger ensemble.

Mdl.Trees stores a 50-by-1 cell vector of the trained classification trees (CompactClassificationTree model objects) that compose the ensemble.

Plot a graph of the first trained classification tree.

view(Mdl.Trees{1},'Mode','graph')

By default, TreeBagger grows deep trees.

Mdl.OOBIndices stores the out-of-bag indices as a matrix of logical values.

Plot the out-of-bag error over the number of grown classification trees.

figure;
oobErrorBaggedEnsemble = oobError(Mdl);
plot(oobErrorBaggedEnsemble)
xlabel 'Number of grown trees';
ylabel 'Out-of-bag classification error';

The out-of-bag error decreases with the number of grown trees.

To label out-of-bag observations, pass Mdl to oobPredict.

Load the carsmall data set. Consider a model that predicts the fuel economy of a car given its engine displacement.

load carsmall

Train an ensemble of bagged regression trees using the entire data set. Specify 100 weak learners.

rng(1); % For reproducibility
Mdl = TreeBagger(100,Displacement,MPG,'Method','regression');

Mdl is a TreeBagger ensemble.

Using a trained bag of regression trees, you can estimate conditional mean responses or perform quantile regression to predict conditional quantiles.

For ten equally-spaced engine displacements between the minimum and maximum in-sample displacement, predict conditional mean responses and conditional quartiles.

predX = linspace(min(Displacement),max(Displacement),10)';
mpgMean = predict(Mdl,predX);
mpgQuartiles = quantilePredict(Mdl,predX,'Quantile',[0.25,0.5,0.75]);

Plot the observations, and estimated mean responses and quartiles in the same figure.

figure;
plot(Displacement,MPG,'o');
hold on
plot(predX,mpgMean);
plot(predX,mpgQuartiles);
ylabel('Fuel economy');
xlabel('Engine displacement');
legend('Data','Mean Response','First quartile','Median','Third quartile');

Load the carsmall data set. Consider a model that predicts the mean fuel economy of a car given its acceleration, number of cylinders, engine displacement, horsepower, manufacturer, model year, and weight. Consider Cylinders, Mfg, and Model_Year as categorical variables.

load carsmall
Cylinders = categorical(Cylinders);
Mfg = categorical(cellstr(Mfg));
Model_Year = categorical(Model_Year);
X = table(Acceleration,Cylinders,Displacement,Horsepower,Mfg,...
    Model_Year,Weight,MPG);
rng('default'); % For reproducibility

Display the number of categories represented in the categorical variables.

numCylinders = numel(categories(Cylinders))
numCylinders = 3
numMfg = numel(categories(Mfg))
numMfg = 28
numModelYear = numel(categories(Model_Year))
numModelYear = 3

Because there are 3 categories only in Cylinders and Model_Year, the standard CART, predictor-splitting algorithm prefers splitting a continuous predictor over these two variables.

Train a random forest of 200 regression trees using the entire data set. To grow unbiased trees, specify usage of the curvature test for splitting predictors. Because there are missing values in the data, specify usage of surrogate splits. Store the out-of-bag information for predictor importance estimation.

Mdl = TreeBagger(200,X,'MPG','Method','regression','Surrogate','on',...
    'PredictorSelection','curvature','OOBPredictorImportance','on');

TreeBagger stores predictor importance estimates in the property OOBPermutedPredictorDeltaError. Compare the estimates using a bar graph.

imp = Mdl.OOBPermutedPredictorDeltaError;

figure;
bar(imp);
title('Curvature Test');
ylabel('Predictor importance estimates');
xlabel('Predictors');
h = gca;
h.XTickLabel = Mdl.PredictorNames;
h.XTickLabelRotation = 45;
h.TickLabelInterpreter = 'none';

In this case, Model_Year is the most important predictor, followed by Weight.

Compare the imp to predictor importance estimates computed from a random forest that grows trees using standard CART.

MdlCART = TreeBagger(200,X,'MPG','Method','regression','Surrogate','on',...
    'OOBPredictorImportance','on');

impCART = MdlCART.OOBPermutedPredictorDeltaError;

figure;
bar(impCART);
title('Standard CART');
ylabel('Predictor importance estimates');
xlabel('Predictors');
h = gca;
h.XTickLabel = Mdl.PredictorNames;
h.XTickLabelRotation = 45;
h.TickLabelInterpreter = 'none';

In this case, Weight, a continuous predictor, is the most important. The next two most importance predictor are Model_Year followed closely by Horsepower, which is a continuous predictor.

Train an ensemble of bagged classification trees for observations in a tall array, and find the misclassification probability of each tree in the model for weighted observations. The sample data set airlinesmall.csv is a large data set that contains a tabular file of airline flight data.

When you perform calculations on tall arrays, MATLAB® uses either a parallel pool (default if you have Parallel Computing Toolbox™) or the local MATLAB session. To run the example using the local MATLAB session when you have Parallel Computing Toolbox, change the global execution environment by using the mapreducer function.

mapreducer(0)

Create a datastore that references the location of the folder containing the data set. Select a subset of the variables to work with, and treat 'NA' values as missing data so that datastore replaces them with NaN values. Create a tall table that contains the data in the datastore.

ds = datastore('airlinesmall.csv');
ds.SelectedVariableNames = {'Month','DayofMonth','DayOfWeek',...
                            'DepTime','ArrDelay','Distance','DepDelay'};
ds.TreatAsMissing = 'NA';
tt  = tall(ds) % Tall table
tt =

  Mx7 tall table

    Month    DayofMonth    DayOfWeek    DepTime    ArrDelay    Distance    DepDelay
    _____    __________    _________    _______    ________    ________    ________

     10          21            3          642          8         308          12   
     10          26            1         1021          8         296           1   
     10          23            5         2055         21         480          20   
     10          23            5         1332         13         296          12   
     10          22            4          629          4         373          -1   
     10          28            3         1446         59         308          63   
     10           8            4          928          3         447          -2   
     10          10            6          859         11         954          -1   
      :          :             :           :          :           :           :
      :          :             :           :          :           :           :

Determine the flights that are late by 10 minutes or more by defining a logical variable that is true for a late flight. This variable contains the class labels. A preview of this variable includes the first few rows.

Y = tt.DepDelay > 10 % Class labels
Y =

  Mx1 tall logical array

   1
   0
   1
   1
   0
   1
   0
   0
   :
   :

Create a tall array for the predictor data.

X = tt{:,1:end-1} % Predictor data
X =

  Mx6 tall double matrix

          10          21           3         642           8         308
          10          26           1        1021           8         296
          10          23           5        2055          21         480
          10          23           5        1332          13         296
          10          22           4         629           4         373
          10          28           3        1446          59         308
          10           8           4         928           3         447
          10          10           6         859          11         954
          :           :            :          :           :           :
          :           :            :          :           :           :

Create a tall array for the observation weights by arbitrarily assigning double weights to the observations in class 1.

W = Y+1; % Weights

Remove rows in X, Y, and W that contain missing data.

R = rmmissing([X Y W]); % Data with missing entries removed
X = R(:,1:end-2); 
Y = R(:,end-1); 
W = R(:,end);

Train an ensemble of 20 bagged decision trees using the entire data set. Specify a weight vector and uniform prior probabilities. For reproducibility, set the seeds of the random number generators using rng and tallrng. The results can vary depending on the number of workers and the execution environment for the tall arrays. For details, see Control Where Your Code Runs.

rng('default') 
tallrng('default')
tMdl = TreeBagger(20,X,Y,'Weights',W,'Prior','Uniform')
Evaluating tall expression using the Local MATLAB Session:
- Pass 1 of 1: Completed in 1.2 sec
Evaluation completed in 1.5 sec
Evaluating tall expression using the Local MATLAB Session:
- Pass 1 of 1: Completed in 2.8 sec
Evaluation completed in 3 sec
Evaluating tall expression using the Local MATLAB Session:
- Pass 1 of 1: Completed in 6.3 sec
Evaluation completed in 6.4 sec
tMdl = 
  CompactTreeBagger
Ensemble with 20 bagged decision trees:
              Method:       classification
       NumPredictors:                    6
          ClassNames: '0' '1'

  Properties, Methods

tMdl is a CompactTreeBagger ensemble with 20 bagged decision trees.

Calculate the misclassification probability of each tree in the model. Attribute a weight contained in the vector W to each observation by using the 'Weights' name-value pair argument.

terr = error(tMdl,X,Y,'Weights',W)
Evaluating tall expression using the Local MATLAB Session:
- Pass 1 of 1: Completed in 6.7 sec
Evaluation completed in 6.8 sec
terr = 20×1

    0.1420
    0.1214
    0.1115
    0.1078
    0.1037
    0.1027
    0.1005
    0.0997
    0.0981
    0.0983
      ⋮

Find the average misclassification probability for the ensemble of decision trees.

avg_terr = mean(terr)
avg_terr = 0.1022

Tips

  • Avoid large estimated out-of-bag error variances by setting a more balanced misclassification cost matrix or a less skewed prior probability vector.

  • The Trees property of B stores a cell array of B.NumTrees CompactClassificationTree or CompactRegressionTree model objects. For a textual or graphical display of tree t in the cell array, enter

    view(B.Trees{t})

  • Standard CART tends to select split predictors containing many distinct values, e.g., continuous variables, over those containing few distinct values, e.g., categorical variables [4]. Consider specifying the curvature or interaction test if any of the following are true:

    • If there are predictors that have relatively fewer distinct values than other predictors, for example, if the predictor data set is heterogeneous.

    • If an analysis of predictor importance is your goal. TreeBagger stores predictor importance estimates in the OOBPermutedPredictorDeltaError property of Mdl.

    For more information on predictor selection, see PredictorSelection for classification trees or PredictorSelection for regression trees.

Algorithms

  • TreeBagger generates in-bag samples by oversampling classes with large misclassification costs and undersampling classes with small misclassification costs. Consequently, out-of-bag samples have fewer observations from classes with large misclassification costs and more observations from classes with small misclassification costs. If you train a classification ensemble using a small data set and a highly skewed cost matrix, then the number of out-of-bag observations per class might be very low. Therefore, the estimated out-of-bag error might have a large variance and might be difficult to interpret. The same phenomenon can occur for classes with large prior probabilities.

  • For details on selecting split predictors and node-splitting algorithms when growing decision trees, see Algorithms for classification trees and Algorithms for regression trees.

Alternative Functionality

Statistics and Machine Learning Toolbox™ offers three objects for bagging and random forest:

For details about the differences between TreeBagger and bagged ensembles (ClassificationBaggedEnsemble and RegressionBaggedEnsemble), see Comparison of TreeBagger and Bagged Ensembles.

References

[1] Breiman, L. Random Forests. Machine Learning 45, pp. 5–32, 2001.

[2] Breiman, L., J. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees. Boca Raton, FL: CRC Press, 1984.

[3] Loh, W.Y. “Regression Trees with Unbiased Variable Selection and Interaction Detection.” Statistica Sinica, Vol. 12, 2002, pp. 361–386.

[4] Loh, W.Y. and Y.S. Shih. “Split Selection Methods for Classification Trees.” Statistica Sinica, Vol. 7, 1997, pp. 815–840.

[5] Meinshausen, N. “Quantile Regression Forests.” Journal of Machine Learning Research, Vol. 7, 2006, pp. 983–999.

Extended Capabilities

Introduced in R2009a