crossval

Class: RegressionGP

Cross-validate Gaussian process regression model

Syntax

cvMdl = crossval(gprMdl)
cvmdl = crossval(gprMdl,Name,Value)

Description

cvMdl = crossval(gprMdl) returns the partitioned model, cvMdl, built from the Gaussian process regression (GPR) model, gprMdl, using 10-fold cross validation.

cvmdl is a RegressionPartitionedModel object, and gprMdl is a RegressionGP (full) object.

cvmdl = crossval(gprMdl,Name,Value) returns the partitioned model, cvmdl, with additional options specified by one or more Name,Value pair arguments. For example, you can specify the number of folds or the fraction of the data to use for testing.

Input Arguments

expand all

Gaussian process regression model, specified as a RegressionGP (full) object. You cannot call crossval on a compact regression object.

Name-Value Pair Arguments

Specify optional comma-separated pairs of Name,Value arguments. Name is the argument name and Value is the corresponding value. Name must appear inside quotes. You can specify several name and value pair arguments in any order as Name1,Value1,...,NameN,ValueN.

Random partition for a k-fold cross validation, specified as the comma-separated pair consisting of 'CVPartition' and a cvpartition object.

Example: 'CVPartition',cvp uses the random partition defined by cvp.

If you specify CVPartition, then you cannot specify Holdout, KFold, or LeaveOut.

Fraction of the data to use for testing in holdout validation, specified as the comma-separated pair consisting of 'Holdout' and a scalar value in the range from 0 to 1. If you specify 'Holdout',p, then crossval:
1. Randomly reserves p*100% of the data as validation data, and trains the model using the rest of the data
2. Stores the compact, trained model in cvgprMdl.Trained.

Example: 'Holdout', 0.3 uses 30% of the data for testing and 70% of the data for training.

If you specify Holdout, then you cannot specify CVPartition, KFold, or Leaveout.

Data Types: single | double

Number of folds to use in cross-validated GPR model, specified as the comma-separated pair consisting of 'KFold' and a positive integer value greater than 1. Kfold must be greater than 1. If you specify 'Kfold',k then crossval:
1. Randomly partitions the data into k sets.
2. For each set, reserves the set as test data, and trains the model using the other k – 1 sets.
3. Stores the k compact, trained models in the cells of a k-by-1 cell array in cvgprMdl.Trained.

Example: 'KFold',5 uses 5 folds in cross-validation. That is, for each fold, it uses that fold as test data, and trains the model on the remaining 4 folds.

If you specify KFold, then you cannot specify CVPartition, Holdout, or Leaveout.

Data Types: single | double

Indicator for leave-one-out cross-validation, specified as the comma-separated pair consisting of 'LeaveOut' and either 'on' or 'off'. If you specify 'Leaveout','on', then, for each of the n observations, crossval:
1. Reserves the observation as test data, and trains the model using the other n – 1 observations.
2. Stores the n compact, trained models in the cells of a n-by-1 cell array in cvgprMdl.Trained.

Example: 'Leaveout','on'

If you specify Leaveout, then you cannot specify CVPartition, Holdout, or KFold.

Output Arguments

expand all

Partitioned Gaussian process regression model, returned as a RegressionPartitionedModel object.

Examples

expand all

Download the housing data [1], from the UCI Machine Learning Repository [4].

The dataset has 506 observations. The first 13 columns contain the predictor values and the last column contains the response values. The goal is to predict the median value of owner-occupied homes in suburban Boston as a function of 13 predictors.

Load the data and define the response vector and the predictor matrix.

load('housing.data');
X = housing(:,1:13);
y = housing(:,end);

Fit a GPR model using the squared exponential kernel function with separate length scale for each predictor. Standardize the predictor variables.

gprMdl = fitrgp(X,y,'KernelFunction','ardsquaredexponential','Standardize',1);

Create a cross-validation partition for data using predictor 4 as a grouping variable.

rng('default') % For reproducibility
cvp = cvpartition(X(:,4),'kfold',10);

Create a 10-fold cross-validated model using the partitioned data in cvp.

cvgprMdl = crossval(gprMdl,'CVPartition',cvp);

Compute the regression loss for in-fold observations using models trained on out-of-fold observations.

L = kfoldLoss(cvgprMdl)
L =

    9.5299

Predict the response for in-fold observations, i.e. observations not used for training.

ypred = kfoldPredict(cvgprMdl);

For every fold, kfoldPredict predicts responses for observations in that fold using the models trained on out-of-fold observations.

Plot the actual responses and prediction data.

plot(y,'r.');
hold on;
plot(ypred,'b--.');
axis([0 510 -15 65]);
legend('True response','GPR prediction','Location','Best');
hold off;

Download the abalone data [2], [3], from the UCI Machine Learning Repository [4] and save it in your current directory with the name abalone.data.

Read the data into a table.

tbl = readtable('abalone.data','Filetype','text','ReadVariableNames',false);

The dataset has 4177 observations. The goal is to predict the age of abalone from 8 physical measurements.

Fit a GPR model using the subset of regressors (sr) method for parameter estimation and fully independent conditional (fic) method for prediction. Standardize the predictors and use a squared exponential kernel function with a separate length scale for each predictor.

gprMdl = fitrgp(tbl,tbl(:,end),'KernelFunction','ardsquaredexponential',...
      'FitMethod','sr','PredictMethod','fic','Standardize',1);

Cross-validate the model using 4-fold cross validation. This partitions the data into 4 sets. For each set, fitrgp uses that set (25% of the data) as the test data, and trains the model on the remaining 3 sets (75% of the data).

rng('default') % For reproducibility
cvgprMdl = crossval(gprMdl,'KFold',4);

Compute the loss over individual folds.

L = kfoldLoss(cvgprMdl,'mode','individual')
L =

    4.3669
    4.6896
    4.0565
    4.3162

Compute the average cross-validated loss on over all folds. The default is the mean squared error.

L2 = kfoldLoss(cvgprMdl)
L2 =

    4.3573

This is equal to the mean loss over individual folds.

mse = mean(L)
mse =

    4.3573

Tips

  • You can only use one of the name-value pair arguments at a time.

  • You cannot compute the prediction intervals for a cross-validated model.

Alternatives

Alternatively, you can train a cross-validated model using the related name-value pair arguments in fitrgp.

If you supply a custom 'ActiveSet' in the call to fitrgp, then you cannot cross validate the GPR model.

References

[1] Harrison, D. and D.L., Rubinfeld. "Hedonic prices and the demand for clean air." J. Environ. Economics & Management. Vol.5, 1978, pp. 81-102.

[2] Warwick J. N., T. L. Sellers, S. R. Talbot, A. J. Cawthorn, and W. B. Ford. "The Population Biology of Abalone (_Haliotis_ species) in Tasmania. I. Blacklip Abalone (_H. rubra_) from the North Coast and Islands of Bass Strait." Sea Fisheries Division, Technical Report No. 48 (ISSN 1034-3288), 1994.

[3] S. Waugh. "Extending and Benchmarking Cascade-Correlation", PhD Thesis. Computer Science Department, University of Tasmania, 1995.

[4] Lichman, M. UCI Machine Learning Repository, Irvine, CA: University of California, School of Information and Computer Science, 2013. http://archive.ics.uci.edu/ml.

Introduced in R2015b