predict

Predict labels using classification tree

Syntax

label = predict(Mdl,X)

label = predict(Mdl,X,Name,Value)

[label,score,node,cnum]
= predict(___)

Description

label = predict(Mdl,X) returns a vector of predicted class labels for the predictor data in the table or matrix X, based on the trained, full or compact classification tree Mdl.

label = predict(Mdl,X,Name,Value) uses additional options specified by one or more Name,Value pair arguments. For example, you can specify to prune Mdl to a particular level before predicting labels.

[label,score,node,cnum] = predict(___) uses any of the input argument in the previous syntaxes and additionally returns:

A matrix of classification scores (score) indicating the likelihood that a label comes from a particular class. For classification trees, scores are posterior probabilities. For each observation in X, the predicted class label corresponds to the minimum expected misclassification cost among all classes.
A vector of predicted node numbers for the classification (node).
A vector of predicted class number for the classification (cnum).

Input Arguments

expand all

`Mdl` — Trained classification tree
`ClassificationTree` model object | `CompactClassificationTree` model object

Trained classification tree, specified as a ClassificationTree or CompactClassificationTree model object. That is, Mdl is a trained classification model returned by fitctree or compact.

`X` — Predictor data to be classified
numeric matrix | table

Predictor data to be classified, specified as a numeric matrix or table.

Each row of X corresponds to one observation, and each column corresponds to one variable.

For a numeric matrix:
- The variables making up the columns of X must have the same order as the predictor variables that trained Mdl.
- If you trained Mdl using a table (for example, Tbl), then X can be a numeric matrix if Tbl contains all numeric predictor variables. To treat numeric predictors in Tbl as categorical during training, identify categorical predictors using the CategoricalPredictors name-value pair argument of fitctree. If Tbl contains heterogeneous predictor variables (for example, numeric and categorical data types) and X is a numeric matrix, then predict throws an error.
For a table:
- predict does not support multi-column variables and cell arrays other than cell arrays of character vectors.
- If you trained Mdl using a table (for example, Tbl), then all predictor variables in X must have the same variable names and data types as those that trained Mdl (stored in Mdl.PredictorNames). However, the column order of X does not need to correspond to the column order of Tbl. Tbl and X can contain additional variables (response variables, observation weights, etc.), but predict ignores them.
- If you trained Mdl using a numeric matrix, then the predictor names in Mdl.PredictorNames and corresponding predictor variable names in X must be the same. To specify predictor names during training, see the PredictorNames name-value pair argument of fitctree. All predictor variables in X must be numeric vectors. X can contain additional variables (response variables, observation weights, etc.), but predict ignores them.

Data Types: table | double | single

Name-Value Pair Arguments

Specify optional comma-separated pairs of Name,Value arguments. Name is the argument name and Value is the corresponding value. Name must appear inside quotes. You can specify several name and value pair arguments in any order as Name1,Value1,...,NameN,ValueN.

`'Subtrees'` — Pruning level
0 (default) | vector of nonnegative integers | `'all'`

Pruning level, specified as the comma-separated pair consisting of 'Subtrees' and a vector of nonnegative integers in ascending order or 'all'.

If you specify a vector, then all elements must be at least 0 and at most max(Mdl.PruneList). 0 indicates the full, unpruned tree and max(Mdl.PruneList) indicates the completely pruned tree (i.e., just the root node).

If you specify 'all', then predict operates on all subtrees (i.e., the entire pruning sequence). This specification is equivalent to using 0:max(Mdl.PruneList).

predict prunes Mdl to each level indicated in Subtrees, and then estimates the corresponding output arguments. The size of Subtrees determines the size of some output arguments.

To invoke Subtrees, the properties PruneList and PruneAlpha of Mdl must be nonempty. In other words, grow Mdl by setting 'Prune','on', or by pruning Mdl using prune.

Example: 'Subtrees','all'

Data Types: single | double | char | string

Output Arguments

expand all

`label` — Predicted class labels
vector | array

Predicted class labels, returned as a vector or array. Each entry of label corresponds to the class with minimal expected cost for the corresponding row of X.

Suppose Subtrees is a numeric vector containing T elements (for 'all', see Subtrees), and X has N rows.

If the response data type is char and:
- T = 1, then label is a character matrix containing N rows. Each row contains the predicted label produced by subtree Subtrees.
- T > 1, then label is an N-by-T cell array.
Otherwise, label is an N-by-T array having the same data type as the response. (The software treats string arrays as cell arrays of character vectors.)

In the latter two cases, column j of label contains the vector of predicted labels produced by subtree Subtrees(j).

`score` — Posterior probabilities
numeric matrix

Posterior probabilities, returned as a numeric matrix of size N-by-K, where N is the number of observations (rows) in X, and K is the number of classes (in Mdl.ClassNames). score(i,j) is the posterior probability that row i of X is of class j.

If Subtrees has T elements, and X has N rows, then score is an N-by-K-by-T array, and node and cnum are N-by-T matrices.

`node` — Node numbers
numeric vector

Node numbers for the predicted classes, returned as a numeric vector. Each entry corresponds to the predicted node in Mdl for the corresponding row of X.

`cnum` — Class numbers
numeric vector

Class numbers corresponding to the predicted labels, returned as a numeric vector. Each entry of cnum corresponds to a predicted class number for the corresponding row of X.

Examples

expand all

Predict Labels Using a Classification Tree

Open Live Script

Examine predictions for a few rows in a data set left out of training.

Load Fisher's iris data set.

load fisheriris

Partition the data into training (50%) and validation (50%) sets.

n = size(meas,1);
rng(1) % For reproducibility
idxTrn = false(n,1);
idxTrn(randsample(n,round(0.5*n))) = true; % Training set logical indices
idxVal = idxTrn == false;                  % Validation set logical indices

Grow a classification tree using the training set.

Mdl = fitctree(meas(idxTrn,:),species(idxTrn));

Predict labels for the validation data. Count the number of misclassified observations.

label = predict(Mdl,meas(idxVal,:));
label(randsample(numel(label),5)) % Display several predicted labels

ans = 5x1 cell
    {'setosa'    }
    {'setosa'    }
    {'setosa'    }
    {'virginica' }
    {'versicolor'}

numMisclass = sum(~strcmp(label,species(idxVal)))

numMisclass = 3

The software misclassifies three out-of-sample observations.

Estimate Class Posterior Probabilities Using a Classification Tree

Open Live Script

Load Fisher's iris data set.

load fisheriris

Partition the data into training (50%) and validation (50%) sets.

n = size(meas,1);
rng(1) % For reproducibility
idxTrn = false(n,1);
idxTrn(randsample(n,round(0.5*n))) = true; % Training set logical indices
idxVal = idxTrn == false;                  % Validation set logical indices

Grow a classification tree using the training set, and then view it.

Mdl = fitctree(meas(idxTrn,:),species(idxTrn));
view(Mdl,'Mode','graph')

The resulting tree has four levels.

Estimate posterior probabilities for the test set using subtrees pruned to levels 1 and 3.

[~,Posterior] = predict(Mdl,meas(idxVal,:),'SubTrees',[1 3]);
Mdl.ClassNames

ans = 3x1 cell
    {'setosa'    }
    {'versicolor'}
    {'virginica' }

Posterior(randsample(size(Posterior,1),5),:,:),...
    % Display several posterior probabilities

ans = 
ans(:,:,1) =

    1.0000         0         0
    1.0000         0         0
    1.0000         0         0
         0         0    1.0000
         0    0.8571    0.1429


ans(:,:,2) =

    0.3733    0.3200    0.3067
    0.3733    0.3200    0.3067
    0.3733    0.3200    0.3067
    0.3733    0.3200    0.3067
    0.3733    0.3200    0.3067

The elements of Posterior are class posterior probabilities:

Rows correspond to observations in the validation set.
Columns correspond to the classes as listed in Mdl.ClassNames.
Pages correspond to the subtrees.

The subtree pruned to level 1 is more sure of its predictions than the subtree pruned to level 3 (i.e., the root node).

More About

expand all

Predicted Class Label

predict classifies by minimizing the expected classification cost:

$\hat{y} = \underset{y = 1, ..., K}{\arg \min} \sum_{j = 1}^{K} \hat{P} (j | x) C (y | j),$

where

$\hat{y}$ is the predicted classification.
K is the number of classes.
$\hat{P} (j | x)$ is the posterior probability of class j for observation x.
$C (y | j)$ is the cost of classifying an observation as y when its true class is j.

Score (tree)

For trees, the score of a classification of a leaf node is the posterior probability of the classification at that node. The posterior probability of the classification at a node is the number of training sequences that lead to that node with the classification, divided by the number of training sequences that lead to that node.

For example, consider classifying a predictor X as true when X < 0.15 or X > 0.95, and X is false otherwise.

Generate 100 random points and classify them:

rng(0,'twister') % for reproducibility
X = rand(100,1);
Y = (abs(X - .55) > .4);
tree = fitctree(X,Y);
view(tree,'Mode','Graph')

Prune the tree:

tree1 = prune(tree,'Level',1);
view(tree1,'Mode','Graph')

The pruned tree correctly classifies observations that are less than 0.15 as true. It also correctly classifies observations from .15 to .94 as false. However, it incorrectly classifies observations that are greater than .94 as false. Therefore, the score for observations that are greater than .15 should be about .05/.85=.06 for true, and about .8/.85=.94 for false.

Compute the prediction scores for the first 10 rows of X:

[~,score] = predict(tree1,X(1:10));
[score X(1:10,:)]

ans = 10×3

    0.9059    0.0941    0.8147
    0.9059    0.0941    0.9058
         0    1.0000    0.1270
    0.9059    0.0941    0.9134
    0.9059    0.0941    0.6324
         0    1.0000    0.0975
    0.9059    0.0941    0.2785
    0.9059    0.0941    0.5469
    0.9059    0.0941    0.9575
    0.9059    0.0941    0.9649

Indeed, every value of X (the right-most column) that is less than 0.15 has associated scores (the left and center columns) of 0 and 1, while the other values of X have associated scores of 0.91 and 0.09. The difference (score 0.09 instead of the expected .06) is due to a statistical fluctuation: there are 8 observations in X in the range (.95,1) instead of the expected 5 observations.

True Misclassification Cost

There are two costs associated with classification: the true misclassification cost per class, and the expected misclassification cost per observation.

You can set the true misclassification cost per class in the Cost name-value pair when you create the classifier using the fitctree method. Cost(i,j) is the cost of classifying an observation into class j if its true class is i. By default, Cost(i,j)=1 if i~=j, and Cost(i,j)=0 if i=j. In other words, the cost is 0 for correct classification, and 1 for incorrect classification.

Expected Cost

There are two costs associated with classification: the true misclassification cost per class, and the expected misclassification cost per observation.

Suppose you have Nobs observations that you want to classify with a trained classifier. Suppose you have K classes. You place the observations into a matrix Xnew with one observation per row.

The expected cost matrix CE has size Nobs-by-K. Each row of CE contains the expected (average) cost of classifying the observation into each of the K classes. CE(n,k) is

$\sum_{i = 1}^{K} \hat{P} (i | X n e w (n)) C (k | i),$

where

K is the number of classes.
$\hat{P} (i | X n e w (n))$ is the posterior probability of class i for observation Xnew(n).
$C (k | i)$ is the true misclassification cost of classifying an observation as k when its true class is i.

Predictive Measure of Association

The predictive measure of association is a value that indicates the similarity between decision rules that split observations. Among all possible decision splits that are compared to the optimal split (found by growing the tree), the best surrogate decision split yields the maximum predictive measure of association. The second-best surrogate split has the second-largest predictive measure of association.

Suppose x_j and x_k are predictor variables j and k, respectively, and j ≠ k. At node t, the predictive measure of association between the optimal split x_j < u and a surrogate split x_k < v is

$λ_{j k} = \frac{min (P_{L}, P_{R}) - (1 - P_{L_{j} L_{k}} - P_{R_{j} R_{k}})}{min (P_{L}, P_{R})} .$

P_L is the proportion of observations in node t, such that x_j < u. The subscript L stands for the left child of node t.
P_R is the proportion of observations in node t, such that x_j ≥ u. The subscript R stands for the right child of node t.
$P_{L_{j} L_{k}}$ is the proportion of observations at node t, such that x_j < u and x_k < v.
$P_{R_{j} R_{k}}$ is the proportion of observations at node t, such that x_j ≥ u and x_k ≥ v.
Observations with missing values for x_j or x_k do not contribute to the proportion calculations.

λ_jk is a value in (–∞,1]. If λ_jk > 0, then x_k < v is a worthwhile surrogate split for x_j < u.

Algorithms

predict generates predictions by following the branches of Mdl until it reaches a leaf node or a missing value. If predict reaches a leaf node, it returns the classification of that node.

If predict reaches a node with a missing value for a predictor, its behavior depends on the setting of the Surrogate name-value pair when fitctree constructs Mdl.

Surrogate = 'off' (default) — predict returns the label with the largest number of training samples that reach the node.
Surrogate = 'on' — predict uses the best surrogate split at the node. If all surrogate split variables with positive predictive measure of association are missing, predict returns the label with the largest number of training samples that reach the node. For a definition, see Predictive Measure of Association.

Extended Capabilities

Tall Arrays
Calculate with arrays that have more rows than fit in memory.

This function fully supports tall arrays. You can use models trained on either in-memory or tall data with this function.

For more information, see Tall Arrays.

C/C++ Code Generation
Generate C and C++ code using MATLAB® Coder™.

Usage notes and limitations:

You can generate C/C++ code for both predict and update by using a coder configurer. Or, generate code only for predict by using saveLearnerForCoder, loadLearnerForCoder, and codegen.
- Code generation for predict and update — Create a coder configurer by using learnerCoderConfigurer and then generate code by using generateCode. Then you can update model parameters in the generated code without having to regenerate the code.
- Code generation for predict — Save a trained model by using saveLearnerForCoder. Define an entry-point function that loads the saved model by using loadLearnerForCoder and calls the predict function. Then use codegen (MATLAB Coder) to generate code for the entry-point function.

You can also generate single-precision C/C++ code for predict. For single-precision code generation, specify the name-value pair argument 'DataType','single' as an additional input to the loadLearnerForCoder function.
You can also generate fixed-point C/C++ code for predict. Fixed-point code generation requires an additional step that defines the fixed-point data types of the variables required for prediction. Create a fixed-point data type structure by using the data type function generated by generateLearnerDataTypeFcn, and use the structure as an input argument of loadLearnerForCoder in an entry-point function. Generating fixed-point C/C++ code requires MATLAB^® Coder™ and Fixed-Point Designer™.

This table contains notes about the arguments of predict. Arguments not included in this table are fully supported.

Argument	Notes and Limitations
`Mdl`	For the usage notes and limitations of the model object, see Code Generation of the `CompactClassificationTree` object.
`X`	For general code generation, `X` must be a single-precision or double-precision matrix or a table containing `single` or `double` predictor variables. If you want to specify `X` as a table, then your model must be trained using a table, and you must ensure that your entry-point function for prediction: Accepts data as arrays Creates a table from the data input arguments and specifies the variable names in the table Passes the table to `predict` For an example of this table workflow, see Generate Code to Classify Numeric Data in Table. For more information on using tables in code generation, see Code Generation for Tables (MATLAB Coder) and Table Limitations for Code Generation (MATLAB Coder). In the coder configurer workflow, `X` must be a `single` or `double` matrix. For fixed-point code generation, `X` must be a fixed-point matrix. The number of rows, or observations, in `X` can be a variable size, but the number of columns in `X` must be fixed.
`label`	If the response data type is `char` and `codegen` cannot determine that the value of `Subtrees` is a scalar, then `label` is a cell array of character vectors.
`'Subtrees'`	Names in name-value pair arguments must be compile-time constants. For example, to allow user-defined pruning levels in the generated code, include `{coder.Constant('Subtrees'),coder.typeof(0,[1,n],[0,1])}` in the `-args` value of `codegen` (MATLAB Coder), where `n` is `max(Mdl.PruneList)`. The `'Subtrees'` name-value pair argument is not supported in the coder configurer workflow. For fixed-point code generation, the `'Subtrees'` value must be `coder.Constant('all')` or have an integer data type.

For more information, see Introduction to Code Generation.

Documentation

predict

Syntax

Description

Input Arguments

`Mdl` — Trained classification tree
`ClassificationTree` model object | `CompactClassificationTree` model object

`X` — Predictor data to be classified
numeric matrix | table

Name-Value Pair Arguments

`'Subtrees'` — Pruning level
0 (default) | vector of nonnegative integers | `'all'`

Output Arguments

`label` — Predicted class labels
vector | array

`score` — Posterior probabilities
numeric matrix

`node` — Node numbers
numeric vector

`cnum` — Class numbers
numeric vector

Examples

Predict Labels Using a Classification Tree

Estimate Class Posterior Probabilities Using a Classification Tree

More About

Predicted Class Label

Score (tree)

True Misclassification Cost

Expected Cost

Predictive Measure of Association

Algorithms

Extended Capabilities

Tall Arrays
Calculate with arrays that have more rows than fit in memory.

C/C++ Code Generation
Generate C and C++ code using MATLAB® Coder™.

See Also

Topics

Statistics and Machine Learning Toolbox Documentation

Support

Documentation

predict

Syntax

Description

Input Arguments

Mdl — Trained classification tree ClassificationTree model object | CompactClassificationTree model object

X — Predictor data to be classified numeric matrix | table

Name-Value Pair Arguments

'Subtrees' — Pruning level 0 (default) | vector of nonnegative integers | 'all'

Output Arguments

label — Predicted class labels vector | array

score — Posterior probabilities numeric matrix

node — Node numbers numeric vector

cnum — Class numbers numeric vector

Examples

Predict Labels Using a Classification Tree

Estimate Class Posterior Probabilities Using a Classification Tree

More About

Predicted Class Label

Score (tree)

True Misclassification Cost

Expected Cost

Predictive Measure of Association

Algorithms

Extended Capabilities

Tall Arrays Calculate with arrays that have more rows than fit in memory.

C/C++ Code Generation Generate C and C++ code using MATLAB® Coder™.

See Also

Topics

Statistics and Machine Learning Toolbox Documentation

Support

`Mdl` — Trained classification tree
`ClassificationTree` model object | `CompactClassificationTree` model object

`X` — Predictor data to be classified
numeric matrix | table

`'Subtrees'` — Pruning level
0 (default) | vector of nonnegative integers | `'all'`

`label` — Predicted class labels
vector | array

`score` — Posterior probabilities
numeric matrix

`node` — Node numbers
numeric vector

`cnum` — Class numbers
numeric vector

Tall Arrays
Calculate with arrays that have more rows than fit in memory.

C/C++ Code Generation
Generate C and C++ code using MATLAB® Coder™.