Compare accuracies of two classification models by repeated cross-validation
testckfold
statistically assesses the accuracies of two
classification models by repeatedly cross-validating the two models, determining the
differences in the classification loss, and then formulating the test statistic by
combining the classification loss differences. This type of test is particularly
appropriate when sample size is limited.
You can assess whether the accuracies of the classification models are different, or
whether one classification model performs better than another. Available tests include a
5-by-2 paired t test, a 5-by-2 paired F test, and
a 10-by-10 repeated cross-validation t test. For more details, see
Repeated Cross-Validation Tests. To speed up computations,
testckfold
supports parallel computing (requires a Parallel Computing Toolbox™ license).
returns
the test decision that results from conducting a 5-by-2 paired F cross-validation
test. The null hypothesis is the classification models h
= testckfold(C1
,C2
,X1
,X2
)C1
and C2
have
equal accuracy in predicting the true class labels using the predictor
and response data in the tables X1
and X2
. h
= 1
indicates
to reject the null hypothesis at the 5% significance level.
testckfold
conducts the cross-validation
test by applying C1
and C2
to
all predictor variables in X1
and X2
,
respectively. The true class labels in X1
and X2
must
be the same. The response variable names in X1
, X2
, C1.ResponseName
,
and C2.ResponseName
must be the same.
For examples of ways to compare models, see Tips.
uses
any of the input arguments in the previous syntaxes and additional
options specified by one or more h
= testckfold(___,Name,Value
)Name,Value
pair
arguments. For example, you can specify the type of alternative hypothesis,
the type of test, or the use of parallel computing.
Examples of ways to compare models include:
Compare the accuracies of a simple classification model and a more complex model by passing the same set of predictor data.
Compare the accuracies of two different models using two different sets of predictors.
Perform various types of Feature Selection. For
example, you can compare the accuracy of a model trained using a
set of predictors to the accuracy of one trained on a subset or
different set of predictors. You can arbitrarily choose the set
of predictors, or use a feature selection technique like PCA or
sequential feature selection (see pca
and sequentialfs
).
If both of these statements are true, then you can
omit supplying Y
.
Consequently, testckfold
uses
the common response variable in the tables.
One way to perform cost-insensitive feature selection is:
Create a classification model template that characterizes
the first classification model (C1
).
Create a classification model template that characterizes
the second classification model (C2
).
Specify two predictor data sets. For example, specify X1
as
the full predictor set and X2
as a reduced set.
Enter testckfold(C1,C2,X1,X2,Y,'Alternative','less')
.
If testckfold
returns 1
, then
there is enough evidence to suggest that the classification model
that uses fewer predictors performs better than the model that uses
the full predictor set.
Alternatively, you can assess whether there is a significant
difference between the accuracies of the two models. To perform this
assessment, remove the 'Alternative','less'
specification
in step 4.testckfold
conducts a two-sided test,
and h = 0
indicates that there is not enough evidence
to suggest a difference in the accuracy of the two models.
The tests are appropriate for the misclassification
rate classification
loss, but you can specify other loss functions (see LossFun
).
The key assumptions are that the estimated classification losses are
independent and normally distributed with mean 0 and finite common
variance under the two-sided null hypothesis. Classification losses
other than the misclassification rate can violate this assumption.
Highly discrete data, imbalanced classes, and highly imbalanced cost matrices can violate the normality assumption of classification loss differences.
If you specify to conduct the 10-by-10 repeated cross-validation t test
using 'Test','10x10t'
, then testckfold
uses
10 degrees of freedom for the t distribution to
find the critical region and estimate the p-value.
For more details, see [2] and [3].
Use testcholdout
:
For test sets with larger sample sizes
To implement variants of the McNemar test to compare two classification model accuracies
For cost-sensitive testing using a chi-square or likelihood
ratio test. The chi-square test uses quadprog
(Optimization Toolbox),
which requires an Optimization Toolbox™ license.
[1] Alpaydin, E. “Combined 5 x 2 CV F Test for Comparing Supervised Classification Learning Algorithms.” Neural Computation, Vol. 11, No. 8, 1999, pp. 1885–1992.
[2] Bouckaert. R. “Choosing Between Two Learning Algorithms Based on Calibrated Tests.” International Conference on Machine Learning, 2003, pp. 51–58.
[3] Bouckaert, R., and E. Frank. “Evaluating the Replicability of Significance Tests for Comparing Learning Algorithms.” Advances in Knowledge Discovery and Data Mining, 8th Pacific-Asia Conference, 2004, pp. 3–12.
[4] Dietterich, T. “Approximate statistical tests for comparing supervised classification learning algorithms.” Neural Computation, Vol. 10, No. 7, 1998, pp. 1895–1923.
[5] Hastie, T., R. Tibshirani, and J. Friedman. The Elements of Statistical Learning, 2nd Ed. New York: Springer, 2008.
templateDiscriminant
| templateECOC
| templateEnsemble
| templateKNN
| templateNaiveBayes
| templateSVM
| templateTree
| testcholdout