rankfeatures

Rank key features by class separability criteria

Syntax

[IDX, Z] = rankfeatures(X, Group)
[IDX, Z] = rankfeatures(X, Group, ...'Criterion', CriterionValue, ...)
[IDX, Z] = rankfeatures(X, Group, ...'CCWeighting', ALPHA, ...)
[IDX, Z] = rankfeatures(X, Group, ...'NWeighting', BETA, ...)
[IDX, Z] = rankfeatures(X, Group, ...'NumberOfIndices', N, ...)
[IDX, Z] = rankfeatures(X, Group, ...'CrossNorm', CN, ...)

Description

[IDX, Z] = rankfeatures(X, Group) ranks the features in X using an independent evaluation criterion for binary classification. X is a matrix where every column is an observed vector and the number of rows corresponds to the original number of features. Group contains the class labels.

IDX is the list of indices to the rows in X with the most significant features. Z is the absolute value of the criterion used (see below).

Group can be a numeric vector, a cell array of character vectors or string vector. numel(Group) is the same as the number of columns in X, and Group must have only two unique values. If it contains any NaN values, the function ignores the corresponding observation vector in X.

[IDX, Z] = rankfeatures(X, Group, ...'PropertyName', PropertyValue, ...) calls rankfeatures with optional properties that use property name/property value pairs. You can specify one or more properties in any order. Each PropertyName must be enclosed in single quotation marks and is case insensitive. These property name/property value pairs are as follows:

[IDX, Z] = rankfeatures(X, Group, ...'Criterion', CriterionValue, ...) sets the criterion used to assess the significance of every feature for separating two labeled groups. Choices are:

  • 'ttest' (default) — Absolute value two-sample t-test with pooled variance estimate.

  • 'entropy' — Relative entropy, also known as Kullback-Leibler distance or divergence.

  • 'bhattacharyya' — Minimum attainable classification error or Chernoff bound.

  • 'roc' — Area between the empirical receiver operating characteristic (ROC) curve and the random classifier slope.

  • 'wilcoxon' — Absolute value of the standardized u-statistic of a two-sample unpaired Wilcoxon test, also known as Mann-Whitney.

Note

'ttest', 'entropy', and 'bhattacharyya' assume normal distributed classes while 'roc' and 'wilcoxon' are nonparametric tests. All tests are feature independent.

[IDX, Z] = rankfeatures(X, Group, ...'CCWeighting', ALPHA, ...) uses correlation information to outweigh the Z value of potential features using Z * (1-ALPHA*(RHO)), where RHO is the average of the absolute values of the cross-correlation coefficient between the candidate feature and all previously selected features. ALPHA sets the weighting factor. It is a scalar value between 0 and 1. When ALPHA is 0 (default) potential features are not weighted. A large value of RHO (close to 1) outweighs the significance statistic; this means that features that are highly correlated with the features already picked are less likely to be included in the output list.

[IDX, Z] = rankfeatures(X, Group, ...'NWeighting', BETA, ...) uses regional information to outweigh the Z value of potential features using Z * (1-exp(-(DIST/BETA).^2)), where DIST is the distance (in rows) between the candidate feature and previously selected features. BETA sets the weighting factor. It is greater than or equal to 0. When BETA is 0 (default) potential features are not weighted. A small DIST (close to 0) outweighs the significance statistics of only close features. This means that features that are close to already picked features are less likely to be included in the output list. This option is useful for extracting features from time series with temporal correlation.

BETA can also be a function of the feature location, specified using @ or an anonymous function. In both cases rankfeatures passes the row position of the feature to BETA() and expects back a value greater than or equal to 0.

Note

You can use 'CCWeighting' and 'NWeighting' together.

[IDX, Z] = rankfeatures(X, Group, ...'NumberOfIndices', N, ...) sets the number of output indices in IDX. Default is the same as the number of features when ALPHA and BETA are 0, or 20 otherwise.

[IDX, Z] = rankfeatures(X, Group, ...'CrossNorm', CN, ...) applies independent normalization across the observations for every feature. Cross-normalization ensures comparability among different features, although it is not always necessary because the selected criterion might already account for this. Choices are:

  • 'none' (default) — Intensities are not cross-normalized.

  • 'meanvar'x_new = (x - mean(x))/std(x)

  • 'softmax'x_new = (1+exp((mean(x)-x)/std(x)))^-1

  • 'minmax'x_new = (x - min(x))/(max(x)-min(x))

Examples

collapse all

Find a reduced set of genes that is sufficient for differentiating breast cancer cells from all other types of cancer in the t-matrix NCI60 data set. Load sample data.

load NCI60tmatrix

Get a logical index vector to the breast cancer cells.

BC = GROUP == 8;

Select features.

I = rankfeatures(X,BC,'NumberOfIndices',12);

Test features with a linear discriminant classifier.

C = classify(X(I,:)',X(I,:)',double(BC));
cp = classperf(BC,C);
cp.CorrectRate
ans =

     1

Use cross-correlation weighting to further reduce the required number of genes.

I = rankfeatures(X,BC,'CCWeighting',0.7,'NumberOfIndices',8);
C = classify(X(I,:)',X(I,:)',double(BC));
cp = classperf(BC,C);
cp.CorrectRate 
ans =

     1

Find the discriminant peaks of two groups of signals with Gaussian pulses modulated by two different sources.

load GaussianPulses
f = rankfeatures(y',grp,'NWeighting',@(x) x/10+5,'NumberOfIndices',5);
plot(t,y(grp==1,:),'b',t,y(grp==2,:),'g',t(f),1.35,'vr')

References

[1] Theodoridis, S., and Koutroumbas, K. (1999). Pattern Recognition, Academic Press, 341-342.

[2] Liu, H., Motoda, H. (1998). Feature Selection for Knowledge Discovery and Data Mining, Kluwer Academic Publishers.

[3] Ross, D.T. et.al. (2000). Systematic Variation in Gene Expression Patterns in Human Cancer Cell Lines. Nature Genetics. 24 (3), 227-235.

Introduced before R2006a