sequentialfs

Sequential feature selection using custom criterion

Syntax

inmodel = sequentialfs(fun,X,y) inmodel = sequentialfs(fun,X,Y,Z,...) [inmodel,history] = sequentialfs(fun,X,...) [] = sequentialfs(...,param1,val1,param2,val2,...)

Description

inmodel = sequentialfs(fun,X,y) selects a subset of features from the data matrix X that best predict the data in y by sequentially selecting features until there is no improvement in prediction. Rows of X correspond to observations; columns correspond to variables or features. y is a column vector of response values or class labels for each observation in X. X and y must have the same number of rows. fun is a function handle to a function that defines the criterion used to select features and to determine when to stop. The output inmodel is a logical vector indicating which features are finally chosen.

Starting from an empty feature set, sequentialfs creates candidate feature subsets by sequentially adding each of the features not yet selected. For each candidate feature subset, sequentialfs performs 10-fold cross-validation by repeatedly calling fun with different training subsets of X and y, XTRAIN and ytrain, and test subsets of X and y, XTEST and ytest, as follows:

criterion = fun(XTRAIN,ytrain,XTEST,ytest)

XTRAIN and ytrain contain the same subset of rows of X and Y, while XTEST and ytest contain the complementary subset of rows. XTRAIN and XTEST contain the data taken from the columns of X that correspond to the current candidate feature set.

Each time it is called, fun must return a scalar value criterion. Typically, fun uses XTRAIN and ytrain to train or fit a model, then predicts values for XTEST using that model, and finally returns some measure of distance, or loss, of those predicted values from ytest. In the cross-validation calculation for a given candidate feature set, sequentialfs sums the values returned by fun and divides that sum by the total number of test observations. It then uses that mean value to evaluate each candidate feature subset.

Typical loss measures include sum of squared errors for regression models (sequentialfs computes the mean-squared error in this case), and the number of misclassified observations for classification models (sequentialfs computes the misclassification rate in this case).

Note

sequentialfs divides the sum of the values returned by fun across all test sets by the total number of test observations. Accordingly, fun should not divide its output value by the number of test observations.

After computing the mean criterion values for each candidate feature subset, sequentialfs chooses the candidate feature subset that minimizes the mean criterion value. This process continues until adding more features does not decrease the criterion.

inmodel = sequentialfs(fun,X,Y,Z,...) allows any number of input variables X, Y, Z, ... . sequentialfs chooses features (columns) only from X, but otherwise imposes no interpretation on X, Y, Z, ... . All data inputs, whether column vectors or matrices, must have the same number of rows. sequentialfs calls fun with training and test subsets of X, Y, Z, ... as follows:

criterion = fun(XTRAIN,YTRAIN,ZTRAIN,...,
                XTEST,YTEST,ZTEST,...)

sequentialfs creates XTRAIN, YTRAIN, ZTRAIN, ... , XTEST, YTEST, ZTEST, ... by selecting subsets of the rows of X, Y, Z, ... . fun must return a scalar value criterion, but may compute that value in any way. Elements of the logical vector inmodel correspond to columns of X and indicate which features are finally chosen.

[inmodel,history] = sequentialfs(fun,X,...) returns information on which feature is chosen at each step. history is a scalar structure with the following fields:

Crit — A vector containing the criterion values computed at each step.
In — A logical matrix in which row i indicates the features selected at step i.

[] = sequentialfs(...,param1,val1,param2,val2,...) specifies optional parameter name/value pairs from the following table.

Parameter	Value
`'cv'`	The validation method used to compute the criterion for each candidate feature subset. When the value is a positive integer `k`, `sequentialfs` uses `k`-fold cross-validation without stratification. When the value is an object of the `cvpartition` class, other forms of cross-validation can be specified. When the value is `'resubstitution'`, the original data are passed to `fun` as both the training and test data to compute the criterion. When the value is `'none'`, `sequentialfs` calls `fun` as `criterion = fun(X,Y,Z,...)`, without separating test and training sets. The default value is `10`, that is, 10-fold cross-validation without stratification. So-called wrapper methods use a function `fun` that implements a learning algorithm. These methods usually apply cross-validation to select features. So-called filter methods use a function `fun` that measures characteristics of the data (such as correlation) to select features.
`'mcreps'`	A positive integer indicating the number of Monte-Carlo repetitions for cross-validation. The default value is `1`. The value must be `1` if the value of `'cv'` is `'resubstitution'` or `'none'`.
`'direction'`	The direction of the sequential search. The default is `'forward'`. A value of `'backward'` specifies an initial candidate set including all features and an algorithm that removes features sequentially until the criterion increases.
`'keepin'`	A logical vector or a vector of column numbers specifying features that must be included. The default is empty.
`'keepout'`	A logical vector or a vector of column numbers specifying features that must be excluded. The default is empty.
`'nfeatures'`	The number of features at which `sequentialfs` should stop. `inmodel` includes exactly this many features. The default value is empty, indicating that `sequentialfs` should stop when a local minimum of the criterion is found. A nonempty value overrides values of `'MaxIter'` and `'TolFun'` in `'options'`.
`'nullmodel'`	A logical value, indicating whether or not the null model (containing no features from `X`) should be included in feature selection and in the `history` output. The default is `false`.
`'options'`	Options structure for the iterative sequential search algorithm, as created by `statset`. `sequentialfs` uses the following `statset` parameters: `Display` — Amount of information displayed by the algorithm. The default is `'off'`. `MaxIter` — Maximum number of iterations allowed. The default is `Inf`. `TolFun` — Termination tolerance for the objective function value. The default is `1e-6` if `'direction'` is `'forward'`; `0` if `'direction'` is `'backward'`. `TolTypeFun` — Use absolute or relative objective function tolerances. The default is `'rel'`. `UseParallel` — Set to `true` to compute in parallel. Default is `false`. `UseSubstreams` — Set to `true` to compute in parallel in a reproducible fashion. Default is `false`. To compute reproducibly, set `Streams` to a type allowing substreams: `'mlfg6331_64'` or `'mrg32k3a'`. `Streams` — A `RandStream` object or cell array consisting of one such object. If you do not specify `Streams`, `sequentialfs` uses the default stream. To compute in parallel, you need Parallel Computing Toolbox™.

Examples

Perform sequential feature selection for classification of noisy features:

load fisheriris
rng('default') % For reproducibility
X = randn(150,10);
X(:,[1 3 5 7])= meas;
y = species;

c = cvpartition(y,'k',10);
opts = statset('Display','iter');
fun = @(XT,yT,Xt,yt)loss(fitcecoc(XT,yT),Xt,yt);

[fs,history] = sequentialfs(fun,X,y,'cv',c,'options',opts)

Start forward sequential feature selection:
Initial columns included:  none
Columns that can not be included:  none
Step 1, added column 5, criterion value 0.00266667
Step 2, added column 7, criterion value 0.00222222
Step 3, added column 1, criterion value 0.00177778
Step 4, added column 3, criterion value 0.000888889
Final columns included:  1 3 5 7 

fs =

  1×10 logical array

   1   0   1   0   1   0   1   0   0   0


history = 

  struct with fields:

      In: [4×10 logical]
    Crit: [0.0027 0.0022 0.0018 8.8889e-04]

history.In

ans =

  4×10 logical array

   0   0   0   0   1   0   0   0   0   0
   0   0   0   0   1   0   1   0   0   0
   1   0   0   0   1   0   1   0   0   0
   1   0   1   0   1   0   1   0   0   0

Extended Capabilities

Automatic Parallel Support
Accelerate code by automatically running computation in parallel using Parallel Computing Toolbox™.

To run in parallel, set the 'UseParallel' option to true.

Set the 'UseParallel' field of the options structure to true using statset and specify the 'Options' name-value pair argument in the call to this function.

For example: 'Options',statset('UseParallel',true)

For more information, see the 'Options' name-value pair argument.

For more general information about parallel computing, see Run MATLAB Functions with Automatic Parallel Support (Parallel Computing Toolbox).

Documentation

sequentialfs

Syntax

Description

Examples

Extended Capabilities

Automatic Parallel Support
Accelerate code by automatically running computation in parallel using Parallel Computing Toolbox™.

See Also

Topics

Statistics and Machine Learning Toolbox Documentation

Support

Documentation

sequentialfs

Syntax

Description

Examples

Extended Capabilities

Automatic Parallel Support Accelerate code by automatically running computation in parallel using Parallel Computing Toolbox™.

See Also

Topics

Statistics and Machine Learning Toolbox Documentation

Support

Automatic Parallel Support
Accelerate code by automatically running computation in parallel using Parallel Computing Toolbox™.