Predict labels using classification tree
uses
additional options specified by one or more label
= predict(Mdl
,X
,Name,Value
)Name,Value
pair
arguments. For example, you can specify to prune Mdl
to
a particular level before predicting labels.
[
uses any of the input argument
in the previous syntaxes and additionally returns:label
,score
,node
,cnum
]
= predict(___)
A matrix of classification scores (score
)
indicating the likelihood that a label comes from a particular class.
For classification trees, scores are posterior probabilities. For
each observation in X
, the predicted class label corresponds
to the minimum expected
misclassification cost among all classes.
A vector of predicted node numbers for the classification
(node
).
A vector of predicted class number for the classification
(cnum
).
Mdl
— Trained classification treeClassificationTree
model object | CompactClassificationTree
model objectTrained classification tree, specified as a ClassificationTree
or CompactClassificationTree
model
object. That is, Mdl
is a trained classification
model returned by fitctree
or compact
.
X
— Predictor data to be classifiedPredictor data to be classified, specified as a numeric matrix or table.
Each row of X
corresponds to one observation,
and each column corresponds to one variable.
For a numeric matrix:
The variables making up the columns of X
must
have the same order as the predictor variables that trained Mdl
.
If you trained Mdl
using a table (for example, Tbl
),
then X
can be a numeric matrix if
Tbl
contains all numeric
predictor variables. To treat numeric predictors in
Tbl
as categorical during
training, identify categorical predictors using the
CategoricalPredictors
name-value pair
argument of fitctree
.
If Tbl
contains heterogeneous
predictor variables (for example, numeric and
categorical data types) and X
is a
numeric matrix, then predict
throws
an error.
For a table:
predict
does not support multi-column
variables and cell arrays other than cell arrays of character vectors.
If you trained Mdl
using a table
(for example, Tbl
), then all predictor variables
in X
must have the same variable names and data
types as those that trained Mdl
(stored in Mdl.PredictorNames
).
However, the column order of X
does not need to
correspond to the column order of Tbl
. Tbl
and X
can
contain additional variables (response variables, observation weights,
etc.), but predict
ignores them.
If you trained Mdl
using a numeric matrix, then the predictor names in
Mdl.PredictorNames
and
corresponding predictor variable names in
X
must be the same. To specify
predictor names during training, see the PredictorNames
name-value pair argument
of fitctree
. All predictor
variables in X
must be numeric
vectors. X
can contain additional
variables (response variables, observation weights,
etc.), but predict
ignores
them.
Data Types: table
| double
| single
Specify optional
comma-separated pairs of Name,Value
arguments. Name
is
the argument name and Value
is the corresponding value.
Name
must appear inside quotes. You can specify several name and value
pair arguments in any order as
Name1,Value1,...,NameN,ValueN
.
'Subtrees'
— Pruning level'all'
Pruning level, specified as the comma-separated pair consisting
of 'Subtrees'
and a vector of nonnegative integers
in ascending order or 'all'
.
If you specify a vector, then all elements must be at least 0
and
at most max(Mdl.PruneList)
. 0
indicates
the full, unpruned tree and max(Mdl.PruneList)
indicates
the completely pruned tree (i.e., just the root node).
If you specify 'all'
, then predict
operates
on all subtrees (i.e., the entire pruning sequence). This specification
is equivalent to using 0:max(Mdl.PruneList)
.
predict
prunes Mdl
to
each level indicated in Subtrees
, and then estimates
the corresponding output arguments. The size of Subtrees
determines
the size of some output arguments.
To invoke Subtrees
, the properties PruneList
and PruneAlpha
of Mdl
must
be nonempty. In other words, grow Mdl
by setting 'Prune','on'
,
or by pruning Mdl
using prune
.
Example: 'Subtrees','all'
Data Types: single
| double
| char
| string
label
— Predicted class labelsPredicted
class labels, returned as a vector or array. Each entry of label
corresponds
to the class with minimal expected cost for the corresponding row
of X
.
Suppose Subtrees
is a numeric vector containing T
elements (for 'all'
, see Subtrees
),
and X
has N
rows.
If the response data type is char
and:
T
= 1, then
label
is a character matrix
containing N
rows. Each row
contains the predicted label produced by subtree
Subtrees
.
T
> 1, then
label
is an
N
-by-T
cell
array.
Otherwise, label
is an
N
-by-T
array having
the same data type as the response. (The software treats string arrays as cell arrays of character
vectors.)
In the latter two cases, column j
of label
contains the vector of predicted labels produced
by subtree Subtrees(
.j
)
score
— Posterior probabilitiesPosterior probabilities, returned as a numeric matrix of size N
-by-K
,
where N
is the number of observations (rows) in X
,
and K
is the number of classes (in Mdl.ClassNames
). score(i,j)
is
the posterior probability that row i
of X
is
of class j
.
If Subtrees
has T
elements,
and X
has N
rows, then score
is
an N
-by-K
-by-T
array,
and node
and cnum
are N
-by-T
matrices.
cnum
— Class numbersClass numbers corresponding to the predicted labels
,
returned as a numeric vector. Each entry of cnum
corresponds
to a predicted class number for the corresponding row of X
.
Examine predictions for a few rows in a data set left out of training.
Load Fisher's iris data set.
load fisheriris
Partition the data into training (50%) and validation (50%) sets.
n = size(meas,1); rng(1) % For reproducibility idxTrn = false(n,1); idxTrn(randsample(n,round(0.5*n))) = true; % Training set logical indices idxVal = idxTrn == false; % Validation set logical indices
Grow a classification tree using the training set.
Mdl = fitctree(meas(idxTrn,:),species(idxTrn));
Predict labels for the validation data. Count the number of misclassified observations.
label = predict(Mdl,meas(idxVal,:));
label(randsample(numel(label),5)) % Display several predicted labels
ans = 5x1 cell
{'setosa' }
{'setosa' }
{'setosa' }
{'virginica' }
{'versicolor'}
numMisclass = sum(~strcmp(label,species(idxVal)))
numMisclass = 3
The software misclassifies three out-of-sample observations.
Load Fisher's iris data set.
load fisheriris
Partition the data into training (50%) and validation (50%) sets.
n = size(meas,1); rng(1) % For reproducibility idxTrn = false(n,1); idxTrn(randsample(n,round(0.5*n))) = true; % Training set logical indices idxVal = idxTrn == false; % Validation set logical indices
Grow a classification tree using the training set, and then view it.
Mdl = fitctree(meas(idxTrn,:),species(idxTrn)); view(Mdl,'Mode','graph')
The resulting tree has four levels.
Estimate posterior probabilities for the test set using subtrees pruned to levels 1 and 3.
[~,Posterior] = predict(Mdl,meas(idxVal,:),'SubTrees',[1 3]);
Mdl.ClassNames
ans = 3x1 cell
{'setosa' }
{'versicolor'}
{'virginica' }
Posterior(randsample(size(Posterior,1),5),:,:),... % Display several posterior probabilities
ans = ans(:,:,1) = 1.0000 0 0 1.0000 0 0 1.0000 0 0 0 0 1.0000 0 0.8571 0.1429 ans(:,:,2) = 0.3733 0.3200 0.3067 0.3733 0.3200 0.3067 0.3733 0.3200 0.3067 0.3733 0.3200 0.3067 0.3733 0.3200 0.3067
The elements of Posterior
are class posterior probabilities:
Rows correspond to observations in the validation set.
Columns correspond to the classes as listed in Mdl.ClassNames
.
Pages correspond to the subtrees.
The subtree pruned to level 1 is more sure of its predictions than the subtree pruned to level 3 (i.e., the root node).
predict
classifies by minimizing the expected
classification cost:
where
is the predicted classification.
K is the number of classes.
is the posterior probability of class j for observation x.
is the cost of classifying an observation as y when its true class is j.
For trees, the score of a classification of a leaf node is the posterior probability of the classification at that node. The posterior probability of the classification at a node is the number of training sequences that lead to that node with the classification, divided by the number of training sequences that lead to that node.
For example, consider classifying a predictor X
as true
when X
< 0.15
or X
> 0.95
, and X
is
false otherwise.
Generate 100 random points and classify them:
rng(0,'twister') % for reproducibility X = rand(100,1); Y = (abs(X - .55) > .4); tree = fitctree(X,Y); view(tree,'Mode','Graph')
Prune the tree:
tree1 = prune(tree,'Level',1); view(tree1,'Mode','Graph')
The pruned tree correctly classifies observations that are less
than 0.15 as true
. It also correctly classifies
observations from .15 to .94 as false
. However,
it incorrectly classifies observations that are greater than .94 as false
.
Therefore, the score for observations that are greater than .15 should
be about .05/.85=.06 for true
, and about .8/.85=.94
for false
.
Compute the prediction scores for the first 10 rows of X
:
[~,score] = predict(tree1,X(1:10)); [score X(1:10,:)]
ans = 10×3
0.9059 0.0941 0.8147
0.9059 0.0941 0.9058
0 1.0000 0.1270
0.9059 0.0941 0.9134
0.9059 0.0941 0.6324
0 1.0000 0.0975
0.9059 0.0941 0.2785
0.9059 0.0941 0.5469
0.9059 0.0941 0.9575
0.9059 0.0941 0.9649
Indeed, every value of X
(the right-most
column) that is less than 0.15 has associated scores (the left and
center columns) of 0
and 1
,
while the other values of X
have associated scores
of 0.91
and 0.09
. The difference
(score 0.09
instead of the expected .06
)
is due to a statistical fluctuation: there are 8
observations
in X
in the range (.95,1)
instead
of the expected 5
observations.
There are two costs associated with classification: the true misclassification cost per class, and the expected misclassification cost per observation.
You can set the true misclassification cost per class in the Cost
name-value
pair when you create the classifier using the fitctree
method. Cost(i,j)
is
the cost of classifying an observation into class j
if
its true class is i
. By default, Cost(i,j)=1
if i~=j
,
and Cost(i,j)=0
if i=j
. In other
words, the cost is 0
for correct classification,
and 1
for incorrect classification.
There are two costs associated with classification: the true misclassification cost per class, and the expected misclassification cost per observation.
Suppose you have Nobs
observations that you
want to classify with a trained classifier. Suppose you have K
classes.
You place the observations into a matrix Xnew
with
one observation per row.
The expected cost matrix CE
has size Nobs
-by-K
.
Each row of CE
contains the expected (average)
cost of classifying the observation into each of the K
classes. CE(n,k)
is
where
K is the number of classes.
is the posterior probability of class i for observation Xnew(n).
is the true misclassification cost of classifying an observation as k when its true class is i.
The predictive measure of association is a value that indicates the similarity between decision rules that split observations. Among all possible decision splits that are compared to the optimal split (found by growing the tree), the best surrogate decision split yields the maximum predictive measure of association. The second-best surrogate split has the second-largest predictive measure of association.
Suppose xj and xk are predictor variables j and k, respectively, and j ≠ k. At node t, the predictive measure of association between the optimal split xj < u and a surrogate split xk < v is
PL is the proportion of observations in node t, such that xj < u. The subscript L stands for the left child of node t.
PR is the proportion of observations in node t, such that xj ≥ u. The subscript R stands for the right child of node t.
is the proportion of observations at node t, such that xj < u and xk < v.
is the proportion of observations at node t, such that xj ≥ u and xk ≥ v.
Observations with missing values for xj or xk do not contribute to the proportion calculations.
λjk is a value in (–∞,1]. If λjk > 0, then xk < v is a worthwhile surrogate split for xj < u.
predict
generates predictions by following
the branches of Mdl
until it reaches a leaf node
or a missing value. If predict
reaches a leaf node,
it returns the classification of that node.
If predict
reaches a node with a missing value
for a predictor, its behavior depends on the setting of the Surrogate
name-value
pair when fitctree
constructs Mdl
.
Surrogate
= 'off'
(default)
— predict
returns the label with the largest
number of training samples that reach the node.
Surrogate
= 'on'
— predict
uses
the best surrogate split at the node. If all surrogate split variables
with positive predictive measure of association are
missing, predict
returns the label with the largest
number of training samples that reach the node. For a definition,
see Predictive Measure of Association.
This function fully supports tall arrays. You can use models trained on either in-memory or tall data with this function.
For more information, see Tall Arrays.
Usage notes and limitations:
You can generate C/C++ code for both predict
and
update
by using a coder configurer. Or, generate code only for
predict
by using saveLearnerForCoder
,
loadLearnerForCoder
, and codegen
.
Code generation for predict
and update
— Create a coder configurer by using learnerCoderConfigurer
and then generate code by using generateCode
. Then you can update model parameters in the
generated code without having to regenerate the code.
Code generation for predict
— Save a trained model by
using saveLearnerForCoder
. Define an
entry-point function that loads the saved model by using loadLearnerForCoder
and calls the
predict
function. Then use codegen
(MATLAB Coder) to generate code for the
entry-point function.
You can also generate single-precision C/C++ code for
predict
. For single-precision code generation, specify the
name-value pair argument 'DataType','single'
as an additional input to the
loadLearnerForCoder
function.
You can also generate fixed-point C/C++ code for
predict
. Fixed-point code generation requires an additional step that
defines the fixed-point data types of the variables required for prediction. Create a
fixed-point data type structure by using the data type function
generated by generateLearnerDataTypeFcn
, and use the structure as an input argument of
loadLearnerForCoder
in an entry-point function. Generating fixed-point
C/C++ code requires MATLAB®
Coder™ and Fixed-Point Designer™.
This table contains
notes about the arguments of predict
. Arguments not included in this
table are fully supported.
Argument | Notes and Limitations |
---|---|
Mdl | For the usage notes and limitations of the model object,
see
Code Generation of the
|
X |
|
label | If the response data type is char
and codegen cannot determine that the
value of Subtrees is a scalar, then
label is a cell array of
character vectors. |
'Subtrees' |
|
For more information, see Introduction to Code Generation.
ClassificationTree
| compact
| CompactClassificationTree
| edge
| fitctree
| loss
| margin
| prune
You have a modified version of this example. Do you want to open this example with your edits?