Predict labels using k-nearest neighbor classification model
returns a vector of predicted class labels for the predictor data in the table or
matrix label
= predict(mdl
,X
)X
, based on the trained k-nearest
neighbor classification model mdl
. See Predicted Class Label.
[
also returns:label
,score
,cost
]
= predict(mdl
,X
)
A matrix of classification scores (score
)
indicating the likelihood that a label comes from a particular class.
For k-nearest neighbor, scores are posterior
probabilities. See Posterior Probability.
A matrix of expected classification cost (cost
).
For each observation in X
, the predicted class
label corresponds to the minimum expected classification costs among all
classes. See Expected Cost.
Create a k-nearest neighbor classifier for Fisher's iris data, where k = 5. Evaluate some model predictions on new data.
Load the Fisher iris data set.
load fisheriris
X = meas;
Y = species;
Create a classifier for five nearest neighbors. Standardize the noncategorical predictor data.
mdl = fitcknn(X,Y,'NumNeighbors',5,'Standardize',1);
Predict the classifications for flowers with minimum, mean, and maximum characteristics.
Xnew = [min(X);mean(X);max(X)]; [label,score,cost] = predict(mdl,Xnew)
label = 3x1 cell
{'versicolor'}
{'versicolor'}
{'virginica' }
score = 3×3
0.4000 0.6000 0
0 1.0000 0
0 0 1.0000
cost = 3×3
0.6000 0.4000 1.0000
1.0000 0 1.0000
1.0000 1.0000 0
The second and third rows of the score and cost matrices have binary values, which means all five nearest neighbors of the mean and maximum flower measurements have identical classifications.
mdl
— k-nearest neighbor classifier modelClassificationKNN
objectk-nearest neighbor classifier model, specified as a
ClassificationKNN
object.
X
— Predictor data to be classifiedPredictor data to be classified, specified as a numeric matrix or table.
Each row of X
corresponds to one observation, and
each column corresponds to one variable.
For a numeric matrix:
The variables that make up the columns of
X
must have the same order as
the predictor variables used to train
mdl
.
If you train mdl
using a table
(for example, Tbl
), then
X
can be a numeric matrix if
Tbl
contains all numeric
predictor variables. k-nearest
neighbor classification requires homogeneous predictors.
Therefore, to treat all numeric predictors in
Tbl
as categorical during
training, set
'CategoricalPredictors','all'
when you train using fitcknn
. If
Tbl
contains heterogeneous
predictors (for example, numeric and categorical data
types) and X
is a numeric matrix,
then predict
throws an
error.
For a table:
predict
does not support
multicolumn variables and cell arrays other than cell
arrays of character vectors.
If you train mdl
using a table
(for example, Tbl
), then all
predictor variables in X
must have
the same variable names and data types as those used to
train mdl
(stored in
mdl.PredictorNames
). However, the
column order of X
does not need to
correspond to the column order of
Tbl
. Both Tbl
and
X
can contain additional
variables (response variables, observation weights, and
so on), but predict
ignores
them.
If you train mdl
using a numeric
matrix, then the predictor names in
mdl.PredictorNames
and
corresponding predictor variable names in
X
must be the same. To specify
predictor names during training, see the PredictorNames
name-value pair argument
of fitcknn
. All predictor variables
in X
must be numeric vectors.
X
can contain additional
variables (response variables, observation weights, and
so on), but predict
ignores
them.
If you set 'Standardize',true
in
fitcknn
to train mdl
, then the
software standardizes the columns of X
using the
corresponding means in mdl.Mu
and standard deviations in
mdl.Sigma
.
Data Types: double
| single
| table
label
— Predicted class labelsPredicted class labels for the observations (rows) in
X
, returned as a categorical array, character
array, logical vector, vector of numeric values, or cell array of character
vectors. label
has length equal to the number of rows
in X
. The label is the class with minimal expected
cost. See Predicted Class Label.
score
— Predicted class scores or posterior probabilitiesPredicted class scores or posterior probabilities, returned as a numeric
matrix of size n-by-K.
n is the number of observations (rows) in
X
, and K is the number of
classes (in mdl.ClassNames
).
score(i,j)
is the posterior probability that
observation i
in X
is of class
j
in mdl.ClassNames
. See Posterior Probability.
Data Types: single
| double
cost
— Expected classification costsExpected classification costs, returned as a numeric matrix of size
n-by-K. n is
the number of observations (rows) in X
, and
K is the number of classes (in
mdl.ClassNames
). cost(i,j)
is the
cost of classifying row i
of X
as
class j
in mdl.ClassNames
. See Expected Cost.
Data Types: single
| double
predict
classifies by minimizing the expected
classification cost:
where
is the predicted classification.
K is the number of classes.
is the posterior probability of class j for observation x.
is the cost of classifying an observation as y when its true class is j.
Consider a vector (single query point) xnew
and a model
mdl
.
k is the number of
nearest neighbors used in prediction,
mdl.NumNeighbors
.
nbd(mdl,xnew)
specifies the
k nearest neighbors to
xnew
in
mdl.X
.
Y(nbd)
specifies the classifications of the
points in nbd(mdl,xnew)
, namely
mdl.Y(nbd)
.
W(nbd)
specifies the weights of the points in
nbd(mdl,xnew)
.
prior
specifies the
priors of the classes in
mdl.Y
.
If the model contains a vector of prior probabilities, then the observation weights
W
are normalized by class to sum to the priors.
This process might involve a calculation for the point xnew
,
because weights can depend on the distance from xnew
to the
points in mdl.X
.
The posterior probability p(j|xnew
)
is
Here, is 1
when
mdl.Y(i) = j
, and
0
otherwise.
Two costs are associated with KNN classification: the true misclassification cost per class and the expected misclassification cost per observation.
You can set the true misclassification cost per class by using the 'Cost'
name-value pair argument when you run fitcknn
. The value Cost(i,j)
is the cost of classifying
an observation into class j
if its true class is i
. By
default, Cost(i,j) = 1
if i ~= j
, and
Cost(i,j) = 0
if i = j
. In other words, the cost
is 0
for correct classification and 1
for incorrect
classification.
Two costs are associated with KNN classification: the true misclassification cost per class
and the expected misclassification cost per observation. The third output of predict
is the expected misclassification cost per
observation.
Suppose you have Nobs
observations that you want to classify with a trained
classifier mdl
, and you have K
classes. You place the
observations into a matrix Xnew
with one observation per row. The
command
[label,score,cost] = predict(mdl,Xnew)
returns a matrix cost
of size
Nobs
-by-K
, among other outputs. Each row of the
cost
matrix contains the expected (average) cost of classifying the
observation into each of the K
classes. cost(n,j)
is
where
K is the number of classes.
is the posterior probability of class i for observation Xnew(n).
is the true misclassification cost of classifying an observation as j when its true class is i.
This function fully supports tall arrays. You can use models trained on either in-memory or tall data with this function.
For more information, see Tall Arrays (MATLAB).
Usage notes and limitations:
Use saveLearnerForCoder
, loadLearnerForCoder
, and codegen
to generate code for the predict
function. Save
a trained model by using saveLearnerForCoder
. Define an entry-point function
that loads the saved model by using loadLearnerForCoder
and calls the
predict
function. Then use codegen
to generate code for the entry-point function.
This table contains
notes about the arguments of predict
. Arguments not included in this
table are fully supported.
Argument | Notes and Limitations |
---|---|
mdl |
|
X |
|
For more information, see Introduction to Code Generation.
You have a modified version of this example. Do you want to open this example with your edits?