Feature extraction is a set of methods that map input features to new output features. Many feature extraction methods use unsupervised learning to extract features. Unlike some feature extraction methods such as PCA and NNMF, the methods described in this section can increase dimensionality (and decrease dimensionality). Internally, the methods involve optimizing nonlinear objective functions. For details, see Sparse Filtering Algorithm or Reconstruction ICA Algorithm.
One typical use of feature extraction is finding features in images. Using these features can lead to improved classification accuracy. For an example, see Feature Extraction Workflow. Another typical use is extracting individual signals from superpositions, which is often termed blind source separation. For an example, see Extract Mixed Signals.
There are two feature extraction functions: rica
and sparsefilt
.
Associated with these functions are the objects that they create: ReconstructionICA
and SparseFiltering
.
The sparse filtering algorithm begins with a data matrix X
that
has n
rows and p
columns. Each
row represents one observation and each column represents one measurement.
The columns are also called the features or predictors. The algorithm
then takes either an initial random p
-by-q
weight
matrix W
or uses the weight matrix passed in the InitialTransformWeights
name-value
pair. q
is the requested number of features that sparsefilt
computes.
The algorithm attempts to minimize the Sparse Filtering Objective Function by using a standard limited
memory Broyden-Fletcher-Goldfarb-Shanno (LBFGS) quasi-Newton optimizer.
See Nocedal and Wright [2]. This optimizer takes up to IterationLimit
iterations.
It stops iterating earlier when it takes a step whose norm is less
than StepTolerance
, or when it computes that the
norm of the gradient at the current point is less than GradientTolerance
times
a scalar τ, where
|f| is the norm of the objective function, and is the infinity norm of the initial gradient.
The objective function attempts to simultaneously obtain few nonzero features for each data point, and for each resulting feature to have nearly equal weight. To understand how the objective function attempts to achieve these goals, see Ngiam, Koh, Chen, Bhaskar, and Ng [1].
Frequently, you obtain good features by setting a relatively
small value of IterationLimit
, from as low as 5
to a few hundred. Allowing the optimizer to continue can result in
overtraining, where the extracted features do not generalize well
to new data.
After constructing a SparseFiltering
object, use the transform
method to map input data to the
new output features.
To compute an objective function, the sparse filtering algorithm
uses the following steps. The objective function depends on the n
-by-p
data
matrix X
and a weight matrix W
that
the optimizer varies. The weight matrix W
has dimensions p
-by-q
,
where p
is the number of original features and q
is
the number of requested features.
Compute the n
-by-q
matrix X*W
.
Apply the approximate absolute value function to
each element of X*W
to obtain the matrix F
. ϕ is
a smooth nonnegative symmetric function that closely approximates
the absolute value function.
Normalize the columns of F
by the
approximate L2 norm.
In other words, define the normalized matrix by
Normalize the rows of by the approximate L2 norm. In other words, define the normalized matrix by
The matrix is
the matrix of converted features in X
. Once sparsefilt
finds
the weights W
that minimize the objective function h (see
below), which the function stores in the output object Mdl
in
the Mdl.TransformWeights
property, the transform
function can follow the same
transformation steps to convert new data to output features.
Compute the objective function h(W
)
as the 1–norm of the matrix ,
meaning the sum of all the elements in the matrix (which are nonnegative
by construction):
If you set the Lambda
name-value
pair to a strictly positive value, sparsefilt
uses
the following modified objective function:
Here, wj is the jth
column of the matrix W
and λ is
the value of Lambda
. The effect of this term is
to shrink the weights W
. If you plot the columns
of W
as images, with positive Lambda
these
images appear smooth compared to the same images with zero Lambda
.
The Reconstruction Independent Component Analysis (RICA) algorithm is based on minimizing an objective function. The algorithm maps input data to output features.
The ICA source model is the following. Each observation x is generated by a random vector s according to
x is a column vector of length p
.
μ is a column vector of length p
representing
a constant term.
s is a column vector of length q
whose
elements are zero mean, unit variance random variables that are statistically
independent of each other.
A is a mixing matrix of size p
-by-q
.
You can use this model in rica
to estimate A from
observations of x. See Extract Mixed Signals.
The RICA algorithm begins with a data matrix X
that
has n
rows and p
columns consisting
of the observations xi:
Each row represents one observation and each column represents
one measurement. The columns are also called the features or predictors.
The algorithm then takes either an initial random p
-by-q
weight
matrix W
or uses the weight matrix passed in the InitialTransformWeights
name-value
pair. q
is the requested number of features that rica
computes.
The weight matrix W
is composed of columns wi of
size p
-by-1:
The algorithm attempts to minimize the Reconstruction ICA Objective Function by using a standard
limited memory Broyden-Fletcher-Goldfarb-Shanno (LBFGS) quasi-Newton
optimizer. See Nocedal and Wright [2]. This optimizer takes up to IterationLimit
iterations.
It stops iterating when it takes a step whose norm is less than StepTolerance
,
or when it computes that the norm of the gradient at the current point
is less than GradientTolerance
times a scalar τ,
where
|f| is the norm of the objective function, and is the infinity norm of the initial gradient.
The objective function attempts to obtain a nearly orthonormal weight matrix that minimizes
the sum of elements of g(XW
), where
g is a function (described below) that is applied elementwise
to XW
. To understand how the objective function attempts to
achieve these goals, see Le, Karpenko, Ngiam, and Ng [3].
After constructing a ReconstructionICA
object, use the transform
method to map input data to the
new output features.
The objective function uses a contrast function, which you specify
by using the ContrastFcn
name-value pair. The contrast
function is a smooth convex function that is similar to an absolute
value. By default, the contrast function is .
For other available contrast functions, see ContrastFcn
.
For an n
-by-p
data matrix X
and q
output
features, with a regularization parameter λ as
the value of the Lambda
name-value
pair, the objective function in terms of the p
-by-q
matrix W
is
The σj are known
constants that are ±1. When σj = +1, minimizing
the objective function h encourages the histogram
of to
be sharply peaked at 0 (super Gaussian). When σj = –1, minimizing
the objective function h encourages the histogram
of to
be flatter near 0 (sub Gaussian). Specify the σj values
using the rica
NonGaussianityIndicator
name-value pair.
The objective function h can have a spurious
minimum of zero when λ is zero. Therefore, rica
minimizes h over W that
are normalized to 1. In other words, each column wj of W is
defined in terms of a column vector vj by
rica
minimizes over the vj.
The resulting minimal matrix W
provides the transformation
from input data X
to output features XW
.
[1] Ngiam, Jiquan, Zhenghao Chen, Sonia A. Bhaskar,
Pang W. Koh, and Andrew Y. Ng. “Sparse Filtering.” Advances
in Neural Information Processing Systems. Vol. 24, 2011,
pp. 1125–1133. https://papers.nips.cc/paper/4334-sparse-filtering.pdf
.
[2] Nocedal, J. and S. J. Wright. Numerical Optimization, Second Edition. Springer Series in Operations Research, Springer Verlag, 2006.
[3] Le, Quoc V., Alexandre Karpenko, Jiquan
Ngiam, and Andrew Y. Ng. “ICA with Reconstruction Cost for
Efficient Overcomplete Feature Learning.” Advances
in Neural Information Processing Systems. Vol. 24, 2011,
pp. 1017–1025. https://papers.nips.cc/paper/4467-ica-with-reconstruction-cost-for-efficient-overcomplete-feature-learning.pdf
.
ReconstructionICA
| rica
| sparsefilt
| SparseFiltering