crossval

Cross-validate Gaussian process regression model

Syntax

cvMdl = crossval(gprMdl) cvmdl = crossval(gprMdl,Name,Value)

Description

cvMdl = crossval(gprMdl) returns the partitioned model, cvMdl, built from the Gaussian process regression (GPR) model, gprMdl, using 10-fold cross validation.

cvmdl is a RegressionPartitionedModel object, and gprMdl is a RegressionGP (full) object.

cvmdl = crossval(gprMdl,Name,Value) returns the partitioned model, cvmdl, with additional options specified by one or more Name,Value pair arguments. For example, you can specify the number of folds or the fraction of the data to use for testing.

Input Arguments

expand all

`gprMdl` — Gaussian process regression model
`RegressionGP` object

Gaussian process regression model, specified as a RegressionGP (full) object. You cannot call crossval on a compact regression object.

Name-Value Pair Arguments

Specify optional comma-separated pairs of Name,Value arguments. Name is the argument name and Value is the corresponding value. Name must appear inside quotes. You can specify several name and value pair arguments in any order as Name1,Value1,...,NameN,ValueN.

`'CVPartition'` — Random partition for a k-fold cross validation
`cvpartition` object

Random partition for a k-fold cross validation, specified as the comma-separated pair consisting of 'CVPartition' and a cvpartition object.

Example: 'CVPartition',cvp uses the random partition defined by cvp.

If you specify CVPartition, then you cannot specify Holdout, KFold, or LeaveOut.

`'Holdout'` — Fraction of data to use for testing
scalar value in the range from 0 to 1

Fraction of the data to use for testing in holdout validation, specified as the comma-separated pair consisting of 'Holdout' and a scalar value in the range from 0 to 1. If you specify 'Holdout',p, then crossval:
1. Randomly reserves p*100% of the data as validation data, and trains the model using the rest of the data
2. Stores the compact, trained model in cvgprMdl.Trained.

Example: 'Holdout', 0.3 uses 30% of the data for testing and 70% of the data for training.

If you specify Holdout, then you cannot specify CVPartition, KFold, or Leaveout.

Data Types: single | double

`'KFold'` — Number of folds
10 (default) | positive integer value greater than 1

Number of folds to use in cross-validated GPR model, specified as the comma-separated pair consisting of 'KFold' and a positive integer value greater than 1. Kfold must be greater than 1. If you specify 'Kfold',k then crossval:
1. Randomly partitions the data into k sets.
2. For each set, reserves the set as test data, and trains the model using the other k – 1 sets.
3. Stores the k compact, trained models in the cells of a k-by-1 cell array in cvgprMdl.Trained.

Example: 'KFold',5 uses 5 folds in cross-validation. That is, for each fold, it uses that fold as test data, and trains the model on the remaining 4 folds.

If you specify KFold, then you cannot specify CVPartition, Holdout, or Leaveout.

Data Types: single | double

`'Leaveout'` — Indicator for leave-one-out cross-validation
`'off'` (default) | `'on'`

Indicator for leave-one-out cross-validation, specified as the comma-separated pair consisting of 'LeaveOut' and either 'on' or 'off'. If you specify 'Leaveout','on', then, for each of the n observations, crossval:
1. Reserves the observation as test data, and trains the model using the other n – 1 observations.
2. Stores the n compact, trained models in the cells of a n-by-1 cell array in cvgprMdl.Trained.

Example: 'Leaveout','on'

If you specify Leaveout, then you cannot specify CVPartition, Holdout, or KFold.

Output Arguments

expand all

`cvgprMdl` — Partitioned Gaussian process regression model
`RegressionPartitionedModel` object

Partitioned Gaussian process regression model, returned as a RegressionPartitionedModel object.

Examples

expand all

Partition Data for Cross-Validation

Download the housing data [1], from the UCI Machine Learning Repository [4].

The dataset has 506 observations. The first 13 columns contain the predictor values and the last column contains the response values. The goal is to predict the median value of owner-occupied homes in suburban Boston as a function of 13 predictors.

Load the data and define the response vector and the predictor matrix.

load('housing.data');
X = housing(:,1:13);
y = housing(:,end);

Fit a GPR model using the squared exponential kernel function with separate length scale for each predictor. Standardize the predictor variables.

gprMdl = fitrgp(X,y,'KernelFunction','ardsquaredexponential','Standardize',1);

Create a cross-validation partition for data using predictor 4 as a grouping variable.

rng('default') % For reproducibility
cvp = cvpartition(X(:,4),'kfold',10);

Create a 10-fold cross-validated model using the partitioned data in cvp.

cvgprMdl = crossval(gprMdl,'CVPartition',cvp);

Compute the regression loss for in-fold observations using models trained on out-of-fold observations.

L = kfoldLoss(cvgprMdl)

L =

    9.5299

Predict the response for in-fold observations, i.e. observations not used for training.

ypred = kfoldPredict(cvgprMdl);

For every fold, kfoldPredict predicts responses for observations in that fold using the models trained on out-of-fold observations.

Plot the actual responses and prediction data.

plot(y,'r.');
hold on;
plot(ypred,'b--.');
axis([0 510 -15 65]);
legend('True response','GPR prediction','Location','Best');
hold off;

Train GPR Model Using 4-Fold Cross Validation

Download the abalone data [2], [3], from the UCI Machine Learning Repository [4] and save it in your current directory with the name abalone.data.

Read the data into a table.

tbl = readtable('abalone.data','Filetype','text','ReadVariableNames',false);

The dataset has 4177 observations. The goal is to predict the age of abalone from 8 physical measurements.

Fit a GPR model using the subset of regressors (sr) method for parameter estimation and fully independent conditional (fic) method for prediction. Standardize the predictors and use a squared exponential kernel function with a separate length scale for each predictor.

gprMdl = fitrgp(tbl,tbl(:,end),'KernelFunction','ardsquaredexponential',...
      'FitMethod','sr','PredictMethod','fic','Standardize',1);

Cross-validate the model using 4-fold cross validation. This partitions the data into 4 sets. For each set, fitrgp uses that set (25% of the data) as the test data, and trains the model on the remaining 3 sets (75% of the data).

rng('default') % For reproducibility
cvgprMdl = crossval(gprMdl,'KFold',4);

Compute the loss over individual folds.

L = kfoldLoss(cvgprMdl,'mode','individual')

Compute the average cross-validated loss on over all folds. The default is the mean squared error.

L2 = kfoldLoss(cvgprMdl)

L2 =

    4.3573

This is equal to the mean loss over individual folds.

mse = mean(L)

mse =

    4.3573

Tips

You can only use one of the name-value pair arguments at a time.
You cannot compute the prediction intervals for a cross-validated model.

Alternatives

Alternatively, you can train a cross-validated model using the related name-value pair arguments in fitrgp.

If you supply a custom 'ActiveSet' in the call to fitrgp, then you cannot cross validate the GPR model.

References

[1] Harrison, D. and D.L., Rubinfeld. "Hedonic prices and the demand for clean air." J. Environ. Economics & Management. Vol.5, 1978, pp. 81-102.

[2] Warwick J. N., T. L. Sellers, S. R. Talbot, A. J. Cawthorn, and W. B. Ford. "The Population Biology of Abalone (_Haliotis_ species) in Tasmania. I. Blacklip Abalone (_H. rubra_) from the North Coast and Islands of Bass Strait." Sea Fisheries Division, Technical Report No. 48 (ISSN 1034-3288), 1994.

[3] S. Waugh. "Extending and Benchmarking Cascade-Correlation", PhD Thesis. Computer Science Department, University of Tasmania, 1995.

[4] Lichman, M. UCI Machine Learning Repository, Irvine, CA: University of California, School of Information and Computer Science, 2013. http://archive.ics.uci.edu/ml.

Documentation

crossval

Syntax

Description

Input Arguments

`gprMdl` — Gaussian process regression model
`RegressionGP` object

Name-Value Pair Arguments

`'CVPartition'` — Random partition for a k-fold cross validation
`cvpartition` object

`'Holdout'` — Fraction of data to use for testing
scalar value in the range from 0 to 1

`'KFold'` — Number of folds
10 (default) | positive integer value greater than 1

`'Leaveout'` — Indicator for leave-one-out cross-validation
`'off'` (default) | `'on'`

Output Arguments

`cvgprMdl` — Partitioned Gaussian process regression model
`RegressionPartitionedModel` object

Examples

Partition Data for Cross-Validation

Train GPR Model Using 4-Fold Cross Validation

Tips

Alternatives

References

See Also

Topics

Statistics and Machine Learning Toolbox Documentation

Support

Documentation

crossval

Syntax

Description

Input Arguments

gprMdl — Gaussian process regression model RegressionGP object

Name-Value Pair Arguments

'CVPartition' — Random partition for a k-fold cross validation cvpartition object

'Holdout' — Fraction of data to use for testing scalar value in the range from 0 to 1

'KFold' — Number of folds 10 (default) | positive integer value greater than 1

'Leaveout' — Indicator for leave-one-out cross-validation 'off' (default) | 'on'

Output Arguments

cvgprMdl — Partitioned Gaussian process regression model RegressionPartitionedModel object

Examples

Partition Data for Cross-Validation

Train GPR Model Using 4-Fold Cross Validation

Tips

Alternatives

References

See Also

Topics

Statistics and Machine Learning Toolbox Documentation

Support

`gprMdl` — Gaussian process regression model
`RegressionGP` object

`'CVPartition'` — Random partition for a k-fold cross validation
`cvpartition` object

`'Holdout'` — Fraction of data to use for testing
scalar value in the range from 0 to 1

`'KFold'` — Number of folds
10 (default) | positive integer value greater than 1

`'Leaveout'` — Indicator for leave-one-out cross-validation
`'off'` (default) | `'on'`

`cvgprMdl` — Partitioned Gaussian process regression model
`RegressionPartitionedModel` object