ClassificationTree class

Superclasses: CompactClassificationTree

Binary decision tree for multiclass classification

Description

A ClassificationTree object represents a decision tree with binary splits for classification. An object of this class can predict responses for new data using the predict method. The object contains the data used for training, so it can also compute resubstitution predictions.

Construction

Create a ClassificationTree object by using fitctree.

Properties

`BinEdges`	Bin edges for numeric predictors, specified as a cell array of p numeric vectors, where p is the number of predictors. Each vector includes the bin edges for a numeric predictor. The element in the cell array for a categorical predictor is empty because the software does not bin categorical predictors. The software bins numeric predictors only if you specify the `'NumBins'` name-value pair argument as a positive integer scalar when training a model with tree learners. The `BinEdges` property is empty if the `'NumBins'` value is empty (default). You can reproduce the binned predictor data `Xbinned` by using the `BinEdges` property of the trained model `mdl`. X = mdl.X; % Predictor data Xbinned = zeros(size(X)); edges = mdl.BinEdges; % Find indices of binned predictors. idxNumeric = find(~cellfun(@isempty,edges)); if iscolumn(idxNumeric) idxNumeric = idxNumeric'; end for j = idxNumeric x = X(:,j); % Convert x to array if x is a table. if istable(x) x = table2array(x); end % Group x into bins by using the `discretize` function. xbinned = discretize(x,[-inf; edges{j}; inf]); Xbinned(:,j) = xbinned; end `Xbinned` contains the bin indices, ranging from 1 to the number of bins, for numeric predictors. `Xbinned` values are 0 for categorical predictors. If `X` contains `NaN`s, then the corresponding `Xbinned` values are `NaN`s.
`CategoricalPredictors`	Categorical predictor indices, specified as a vector of positive integers. `CategoricalPredictors` contains index values corresponding to the columns of the predictor data that contain categorical predictors. If none of the predictors are categorical, then this property is empty (`[]`).
`CategoricalSplit`	An n-by-2 cell array, where n is the number of categorical splits in `tree`. Each row in `CategoricalSplits` gives left and right values for a categorical split. For each branch node with categorical split `j` based on a categorical predictor variable `z`, the left child is chosen if `z` is in `CategoricalSplits(j,1)` and the right child is chosen if `z` is in `CategoricalSplits(j,2)`. The splits are in the same order as nodes of the tree. Find the nodes for these splits by selecting `'categorical'` cuts from top to bottom in the `CutType` property.
`Children`	An n-by-2 array containing the numbers of the child nodes for each node in `tree`, where n is the number of nodes. Leaf nodes have child node `0`.
`ClassCount`	An n-by-k array of class counts for the nodes in `tree`, where n is the number of nodes and k is the number of classes. For any node number `i`, the class counts `ClassCount(i,:)` are counts of observations (from the data used in fitting the tree) from each class satisfying the conditions for node `i`.
`ClassNames`	List of the elements in `Y` with duplicates removed. `ClassNames` can be a categorical array, cell array of character vectors, character array, logical vector, or a numeric vector. `ClassNames` has the same data type as the data in the argument `Y`. (The software treats string arrays as cell arrays of character vectors.)
`ClassProbability`	An n-by-k array of class probabilities for the nodes in `tree`, where n is the number of nodes and k is the number of classes. For any node number `i`, the class probabilities `ClassProbability(i,:)` are the estimated probabilities for each class for a point satisfying the conditions for node `i`.
`Cost`	Square matrix, where `Cost(i,j)` is the cost of classifying a point into class `j` if its true class is `i` (the rows correspond to the true class and the columns correspond to the predicted class). The order of the rows and columns of `Cost` corresponds to the order of the classes in `ClassNames`. The number of rows and columns in `Cost` is the number of unique classes in the response. This property is read-only.
`CutCategories`	An n-by-2 cell array of the categories used at branches in `tree`, where n is the number of nodes. For each branch node `i` based on a categorical predictor variable `X`, the left child is chosen if `X` is among the categories listed in `CutCategories{i,1}`, and the right child is chosen if `X` is among those listed in `CutCategories{i,2}`. Both columns of `CutCategories` are empty for branch nodes based on continuous predictors and for leaf nodes. `CutPoint` contains the cut points for `'continuous'` cuts, and `CutCategories` contains the set of categories.
`CutPoint`	An n-element vector of the values used as cut points in `tree`, where n is the number of nodes. For each branch node `i` based on a continuous predictor variable `X`, the left child is chosen if `X<CutPoint(i)` and the right child is chosen if `X>=CutPoint(i)`. `CutPoint` is `NaN` for branch nodes based on categorical predictors and for leaf nodes. `CutPoint` contains the cut points for `'continuous'` cuts, and `CutCategories` contains the set of categories.
`CutType`	An n-element cell array indicating the type of cut at each node in `tree`, where n is the number of nodes. For each node `i`, `CutType{i}` is: `'continuous'` — If the cut is defined in the form `X < v` for a variable `X` and cut point `v`. `'categorical'` — If the cut is defined by whether a variable `X` takes a value in a set of categories. `''` — If `i` is a leaf node. `CutPoint` contains the cut points for `'continuous'` cuts, and `CutCategories` contains the set of categories.
`CutPredictor`	An n-element cell array of the names of the variables used for branching in each node in `tree`, where n is the number of nodes. These variables are sometimes known as cut variables. For leaf nodes, `CutPredictor` contains an empty character vector. `CutPoint` contains the cut points for `'continuous'` cuts, and `CutCategories` contains the set of categories.
`CutPredictorIndex`	An n-element array of numeric indices for the variables used for branching in each node in `tree`, where n is the number of nodes. For more information, see `CutPredictor`.
`ExpandedPredictorNames`	Expanded predictor names, stored as a cell array of character vectors. If the model uses encoding for categorical variables, then `ExpandedPredictorNames` includes the names that describe the expanded variables. Otherwise, `ExpandedPredictorNames` is the same as `PredictorNames`.
`HyperparameterOptimizationResults`	Description of the cross-validation optimization of hyperparameters, stored as a `BayesianOptimization` object or a table of hyperparameters and associated values. Nonempty when the `OptimizeHyperparameters` name-value pair is nonempty at creation. Value depends on the setting of the `HyperparameterOptimizationOptions` name-value pair at creation: `'bayesopt'` (default) — Object of class `BayesianOptimization` `'gridsearch'` or `'randomsearch'` — Table of hyperparameters used, observed objective function values (cross-validation loss), and rank of observations from lowest (best) to highest (worst)
`IsBranchNode`	An n-element logical vector that is `true` for each branch node and `false` for each leaf node of `tree`.
`ModelParameters`	Parameters used in training `tree`. To display all parameter values, enter `tree.ModelParameters`. To access a particular parameter, use dot notation.
`NumObservations`	Number of observations in the training data, a numeric scalar. `NumObservations` can be less than the number of rows of input data `X` when there are missing values in `X` or response `Y`.
`NodeClass`	An n-element cell array with the names of the most probable classes in each node of `tree`, where n is the number of nodes in the tree. Every element of this array is a character vector equal to one of the class names in `ClassNames`.
`NodeError`	An n-element vector of the errors of the nodes in `tree`, where n is the number of nodes. `NodeError(i)` is the misclassification probability for node `i`.
`NodeProbability`	An n-element vector of the probabilities of the nodes in `tree`, where n is the number of nodes. The probability of a node is computed as the proportion of observations from the original data that satisfy the conditions for the node. This proportion is adjusted for any prior probabilities assigned to each class.
`NodeRisk`	An n-element vector of the risk of the nodes in the tree, where n is the number of nodes. The risk for each node is the measure of impurity (Gini index or deviance) for this node weighted by the node probability. If the tree is grown by twoing, the risk for each node is zero.
`NodeSize`	An n-element vector of the sizes of the nodes in `tree`, where n is the number of nodes. The size of a node is defined as the number of observations from the data used to create the tree that satisfy the conditions for the node.
`NumNodes`	The number of nodes in `tree`.
`Parent`	An n-element vector containing the number of the parent node for each node in `tree`, where n is the number of nodes. The parent of the root node is `0`.
`PredictorNames`	Cell array of character vectors containing the predictor names, in the order which they appear in `X`.
`Prior`	Numeric vector of prior probabilities for each class. The order of the elements of `Prior` corresponds to the order of the classes in `ClassNames`. The number of elements of `Prior` is the number of unique classes in the response. This property is read-only.
`PruneAlpha`	Numeric vector with one element per pruning level. If the pruning level ranges from 0 to M, then `PruneAlpha` has M + 1 elements sorted in ascending order. `PruneAlpha(1)` is for pruning level 0 (no pruning), `PruneAlpha(2)` is for pruning level 1, and so on.
`PruneList`	An n-element numeric vector with the pruning levels in each node of `tree`, where n is the number of nodes. The pruning levels range from 0 (no pruning) to M, where M is the distance between the deepest leaf and the root node.
`ResponseName`	A character vector that specifies the name of the response variable (`Y`).
`RowsUsed`	An n-element logical vector indicating which rows of the original predictor data (`X`) were used in fitting. If the software uses all rows of `X`, then `RowsUsed` is an empty array (`[]`).
`ScoreTransform`	Function handle for transforming predicted classification scores, or character vector representing a built-in transformation function. `none` means no transformation, or `@(x)x`. To change the score transformation function to, for example, `function`, use dot notation. For available functions (see `fitctree`), enter Mdl.ScoreTransform = 'function'; You can set a function handle for an available function, or a function you define yourself by entering tree.ScoreTransform = @function;
`SurrogateCutCategories`	An n-element cell array of the categories used for surrogate splits in `tree`, where n is the number of nodes in `tree`. For each node `k`, `SurrogateCutCategories{k}` is a cell array. The length of `SurrogateCutCategories{k}` is equal to the number of surrogate predictors found at this node. Every element of `SurrogateCutCategories{k}` is either an empty character vector for a continuous surrogate predictor, or is a two-element cell array with categories for a categorical surrogate predictor. The first element of this two-element cell array lists categories assigned to the left child by this surrogate split, and the second element of this two-element cell array lists categories assigned to the right child by this surrogate split. The order of the surrogate split variables at each node is matched to the order of variables in `SurrogateCutPredictor`. The optimal-split variable at this node does not appear. For nonbranch (leaf) nodes, `SurrogateCutCategories` contains an empty cell.
`SurrogateCutFlip`	An n-element cell array of the numeric cut assignments used for surrogate splits in `tree`, where n is the number of nodes in `tree`. For each node `k`, `SurrogateCutFlip{k}` is a numeric vector. The length of `SurrogateCutFlip{k}` is equal to the number of surrogate predictors found at this node. Every element of `SurrogateCutFlip{k}` is either zero for a categorical surrogate predictor, or a numeric cut assignment for a continuous surrogate predictor. The numeric cut assignment can be either –1 or +1. For every surrogate split with a numeric cut C based on a continuous predictor variable Z, the left child is chosen if Z<C and the cut assignment for this surrogate split is +1, or if Z≥C and the cut assignment for this surrogate split is –1. Similarly, the right child is chosen if Z≥C and the cut assignment for this surrogate split is +1, or if Z<C and the cut assignment for this surrogate split is –1. The order of the surrogate split variables at each node is matched to the order of variables in `SurrogateCutPredictor`. The optimal-split variable at this node does not appear. For nonbranch (leaf) nodes, `SurrogateCutFlip` contains an empty array.
`SurrogateCutPoint`	An n-element cell array of the numeric values used for surrogate splits in `tree`, where n is the number of nodes in `tree`. For each node `k`, `SurrogateCutPoint{k}` is a numeric vector. The length of `SurrogateCutPoint{k}` is equal to the number of surrogate predictors found at this node. Every element of `SurrogateCutPoint{k}` is either `NaN` for a categorical surrogate predictor, or a numeric cut for a continuous surrogate predictor. For every surrogate split with a numeric cut C based on a continuous predictor variable Z, the left child is chosen if Z<C and `SurrogateCutFlip` for this surrogate split is +1, or if Z≥C and `SurrogateCutFlip` for this surrogate split is –1. Similarly, the right child is chosen if Z≥C and `SurrogateCutFlip` for this surrogate split is +1, or if Z<C and `SurrogateCutFlip` for this surrogate split is –1. The order of the surrogate split variables at each node is matched to the order of variables returned by `SurrogateCutPredictor`. The optimal-split variable at this node does not appear. For nonbranch (leaf) nodes, `SurrogateCutPoint` contains an empty cell.
`SurrogateCutType`	An n-element cell array indicating types of surrogate splits at each node in `tree`, where n is the number of nodes in `tree`. For each node `k`, `SurrogateCutType{k}` is a cell array with the types of the surrogate split variables at this node. The variables are sorted by the predictive measure of association with the optimal predictor in the descending order, and only variables with the positive predictive measure are included. The order of the surrogate split variables at each node is matched to the order of variables in `SurrogateCutPredictor`. The optimal-split variable at this node does not appear. For nonbranch (leaf) nodes, `SurrogateCutType` contains an empty cell. A surrogate split type can be either `'continuous'` if the cut is defined in the form `Z`<`V` for a variable `Z` and cut point `V` or `'categorical'` if the cut is defined by whether `Z` takes a value in a set of categories.
`SurrogateCutPredictor`	An n-element cell array of the names of the variables used for surrogate splits in each node in `tree`, where n is the number of nodes in `tree`. Every element of `SurrogateCutPredictor` is a cell array with the names of the surrogate split variables at this node. The variables are sorted by the predictive measure of association with the optimal predictor in the descending order, and only variables with the positive predictive measure are included. The optimal-split variable at this node does not appear. For nonbranch (leaf) nodes, `SurrogateCutPredictor` contains an empty cell.
`SurrogatePredictorAssociation`	An n-element cell array of the predictive measures of association for surrogate splits in `tree`, where n is the number of nodes in `tree`. For each node `k`, `SurrogatePredictorAssociation{k}` is a numeric vector. The length of `SurrogatePredictorAssociation{k}` is equal to the number of surrogate predictors found at this node. Every element of `SurrogatePredictorAssociation{k}` gives the predictive measure of association between the optimal split and this surrogate split. The order of the surrogate split variables at each node is the order of variables in `SurrogateCutPredictor`. The optimal-split variable at this node does not appear. For nonbranch (leaf) nodes, `SurrogatePredictorAssociation` contains an empty cell.
`W`	The scaled `weights`, a vector with length n, the number of rows in `X`.
`X`	A matrix or table of predictor values. Each column of `X` represents one variable, and each row represents one observation.
`Y`	A categorical array, cell array of character vectors, character array, logical vector, or a numeric vector. Each row of `Y` represents the classification of the corresponding row of `X`.

Object Functions

`compact`	Compact tree
`crossval`	Cross-validated decision tree
`cvloss`	Classification error by cross validation
`edge`	Classification edge
`loss`	Classification error
`margin`	Classification margins
`partialDependence`	Compute partial dependence
`plotPartialDependence`	Create partial dependence plot (PDP) and individual conditional expectation (ICE) plots
`predict`	Predict labels using classification tree
`predictorImportance`	Estimates of predictor importance for classification tree
`prune`	Produce sequence of classification subtrees by pruning
`resubEdge`	Classification edge by resubstitution
`resubLoss`	Classification error by resubstitution
`resubMargin`	Classification margins by resubstitution
`resubPredict`	Predict resubstitution labels of classification tree
`surrogateAssociation`	Mean predictive measure of association for surrogate splits in classification tree
`view`	View classification tree

Copy Semantics

Value. To learn how value classes affect copy operations, see Copying Objects.

Examples

collapse all

Grow a Classification Tree

Open Live Script

Grow a classification tree using the ionosphere data set.

load ionosphere
tc = fitctree(X,Y)

tc = 
  ClassificationTree
             ResponseName: 'Y'
    CategoricalPredictors: []
               ClassNames: {'b'  'g'}
           ScoreTransform: 'none'
          NumObservations: 351


  Properties, Methods

Control Tree Depth

Open Live Script

You can control the depth of the trees using the MaxNumSplits, MinLeafSize, or MinParentSize name-value pair parameters. fitctree grows deep decision trees by default. You can grow shallower trees to reduce model complexity or computation time.

Load the ionosphere data set.

load ionosphere

The default values of the tree depth controllers for growing classification trees are:

n - 1 for MaxNumSplits. n is the training sample size.
1 for MinLeafSize.
10 for MinParentSize.

These default values tend to grow deep trees for large training sample sizes.

Train a classification tree using the default values for tree depth control. Cross-validate the model by using 10-fold cross-validation.

rng(1); % For reproducibility
MdlDefault = fitctree(X,Y,'CrossVal','on');

Draw a histogram of the number of imposed splits on the trees. Also, view one of the trees.

numBranches = @(x)sum(x.IsBranch);
mdlDefaultNumSplits = cellfun(numBranches, MdlDefault.Trained);

figure;
histogram(mdlDefaultNumSplits)

view(MdlDefault.Trained{1},'Mode','graph')

The average number of splits is around 15.

Suppose that you want a classification tree that is not as complex (deep) as the ones trained using the default number of splits. Train another classification tree, but set the maximum number of splits at 7, which is about half the mean number of splits from the default classification tree. Cross-validate the model by using 10-fold cross-validation.

Mdl7 = fitctree(X,Y,'MaxNumSplits',7,'CrossVal','on');
view(Mdl7.Trained{1},'Mode','graph')

Compare the cross-validation classification errors of the models.

classErrorDefault = kfoldLoss(MdlDefault)

classErrorDefault = 0.1168

classError7 = kfoldLoss(Mdl7)

classError7 = 0.1311

Mdl7 is much less complex and performs only slightly worse than MdlDefault.

More About

expand all

Impurity and Node Error

ClassificationTree splits nodes based on either impurity or node error.

Impurity means one of several things, depending on your choice of the SplitCriterion name-value pair argument:

Gini's Diversity Index (gdi) — The Gini index of a node is
$1 - \sum_{i} p^{2} (i),$
where the sum is over the classes i at the node, and p(i) is the observed fraction of classes with class i that reach the node. A node with just one class (a pure node) has Gini index 0; otherwise the Gini index is positive. So the Gini index is a measure of node impurity.
Deviance ('deviance') — With p(i) defined the same as for the Gini index, the deviance of a node is
$- \sum_{i} p (i) \log_{2} p (i) .$
A pure node has deviance 0; otherwise, the deviance is positive.
Twoing rule ('twoing') — Twoing is not a purity measure of a node, but is a different measure for deciding how to split a node. Let L(i) denote the fraction of members of class i in the left child node after a split, and R(i) denote the fraction of members of class i in the right child node after a split. Choose the split criterion to maximize
$P (L) P (R) {(\sum_{i} | L (i) - R (i) |)}^{2},$
where P(L) and P(R) are the fractions of observations that split to the left and right respectively. If the expression is large, the split made each child node purer. Similarly, if the expression is small, the split made each child node similar to each other, and therefore similar to the parent node. The split did not increase node purity.
Node error — The node error is the fraction of misclassified classes at a node. If j is the class with the largest number of training samples at a node, the node error is
1 – p(j).

References

[1] Breiman, L., J. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees. Boca Raton, FL: CRC Press, 1984.

Extended Capabilities

C/C++ Code Generation
Generate C and C++ code using MATLAB® Coder™.

Usage notes and limitations:

The predict and update functions support code generation.
When you train a classification tree using fitctree, the following restrictions apply.
- The class labels input argument value (Y) cannot be a categorical array.
- Code generation does not support categorical predictors (logical, categorical, char, string, or cell). If you supply training data in a table, the predictors must be numeric (double or single). Also, you cannot use the 'CategoricalPredictors' name-value pair argument. To include categorical predictors in a model, preprocess the categorical predictors by using dummyvar before fitting the model.
- The value of the 'ClassNames' name-value pair argument cannot be a categorical array.
- The value of the 'ScoreTransform' name-value pair argument cannot be an anonymous function. For fixed-point code generation, the 'ScoreTransform' value cannot be 'invlogit'.
- You cannot use surrogate splits, that is, the value of the 'Surrogate' name-value pair argument must be 'off'.

For more information, see Introduction to Code Generation.

Documentation

ClassificationTree class

Description

Construction

Properties

Object Functions

Copy Semantics

Examples

Grow a Classification Tree

Control Tree Depth

More About

Impurity and Node Error

References

Extended Capabilities

C/C++ Code Generation
Generate C and C++ code using MATLAB® Coder™.

See Also

Topics

Statistics and Machine Learning Toolbox Documentation

Support

Documentation

ClassificationTree class

Description

Construction

Properties

Object Functions

Copy Semantics

Examples

Grow a Classification Tree

Control Tree Depth

More About

Impurity and Node Error

References

Extended Capabilities

C/C++ Code Generation Generate C and C++ code using MATLAB® Coder™.

See Also

Topics

Statistics and Machine Learning Toolbox Documentation

Support

C/C++ Code Generation
Generate C and C++ code using MATLAB® Coder™.