When training a network using layers, layer graphs, or dlnetwork
objects,
the software automatically initializes the learnable parameters according to the layer
initialization properties. When defining a deep learning model as a function, you must
initialize the learnable parameters manually.
The initialization of learnable parameters (for example, weights and biases) can have a big impact on how quickly a deep learning model converges.
Tip
This page explains how to initialize learnable parameters for deep learning models
defined as functions in a custom training loop. To learn how to specify the learnable
parameter initialization for deep learning layers use the corresponding layer
properties. For example, to set the weights initializer of a convolution2dLayer
object, use the WeightsInitializer
property.
This table shows the default initializations for the learnable parameters for each layer, and provides links that show how to initialize learnable parameters for model functions using the same initialization.
Layer | Learnable Parameter | Default Initialization |
---|---|---|
convolution2dLayer | Weights | Glorot Initialization |
Bias | Zeros Initialization | |
convolution3dLayer | Weights | Glorot Initialization |
Bias | Zeros Initialization | |
groupedConvolution2dLayer | Weights | Glorot Initialization |
Bias | Zeros Initialization | |
transposedConv2dLayer | Weights | Glorot Initialization |
Bias | Zeros Initialization | |
transposedConv3dLayer | Weights | Glorot Initialization |
Bias | Zeros Initialization | |
fullyConnectedLayer | Weights | Glorot Initialization |
Bias | Zeros Initialization | |
batchNormalizationLayer | Offset | Zeros Initialization |
Scale | Ones Initialization | |
lstmLayer | Input weights | Glorot Initialization |
Recurrent weights | Orthogonal Initialization | |
Bias | Unit Forget Gate Initialization | |
gruLayer | Input weights | Glorot Initialization |
Recurrent weights | Orthogonal Initialization | |
Bias | Zeros Initialization | |
wordEmbeddingLayer | Weights | Gaussian Initialization, with mean 0 and standard deviation 0.01. |
When initializing learnable parameters for model functions, you must specify parameters of the correct size. The size of the learnable parameters depends on the type of deep learning operation.
Operation | Learnable Parameter | Size |
---|---|---|
batchnorm | Offset |
|
Scale |
| |
dlconv | Weights |
|
Bias | One of the following:
| |
dlconv (grouped) | Weights |
|
Bias | One of the following:
| |
dltranspconv | Weights |
|
Bias | One of the following:
| |
dltranspconv (grouped) | Weights |
|
Bias | One of the following:
| |
fullyconnect | Weights |
|
Bias |
| |
gru | Input weights |
|
Recurrent weights |
| |
Bias |
| |
lstm | Input weights |
|
Recurrent weights |
| |
Bias |
|
The Glorot (also known as Xavier) initializer [1] samples weights from the uniform distribution with bounds , where the values of No and Ni depend on the type of deep learning operation:
Operation | Learnable Parameter | No | Ni |
---|---|---|---|
dlconv | Weights |
|
|
dlconv (grouped) | Weights |
|
|
dltranspconv | Weights |
|
|
dltranspconv (grouped) | Weights |
|
|
fullyconnect | Weights | The number of output channels of the operation | The number of input channels of the operation |
gru | Input weights | 3*numHiddenUnits , where
numHiddenUnits is the number of hidden units
of the operation. | The number of input channels of the operation |
Recurrent weights | 3*numHiddenUnits , where
numHiddenUnits is the number of hidden units
of the operation. | The number of hidden units of the operation. | |
lstm | Input weights | 4*numHiddenUnits , where
numHiddenUnits is the number of hidden units
of the operation. | The number of input channels of the operation |
Recurrent weights | 4*numHiddenUnits , where
numHiddenUnits is the number of hidden units
of the operation. | The number of hidden units of the operation. |
To initialize learnable parameters using the Glorot initializer easily, you can define
a custom function. The function initializeGlorot
, takes as input the
size of the learnable parameters sz
and the values
No and
Ni (numOut
and
numIn
, respectively), and returns the sampled weights as a
dlarray
object with underlying type
'single'
.
function weights = initializeGlorot(sz,numOut,numIn) Z = 2*rand(sz,'single') - 1; bound = sqrt(6 / (numIn + numOut)); weights = bound * Z; weights = dlarray(weights); end
Initialize the weights for a convolutional operation with 128 filters of size 5-by-5 and 3 input channels.
filterSize = [5 5]; numChannels = 3; numFilters = 128; sz = [filterSize numChannels numFilters]; numOut = prod(filterSize) * numFilters; numIn = prod(filterSize) * numFilters; parameters.conv.Weights = initializeGlorot(sz,numOut,numIn);
The He initializer [44] samples weights from the normal distribution with zero mean and variance , where the value Ni depends on type of deep learning operation:
Operation | Learnable Parameter | Ni |
---|---|---|
dlconv | Weights |
|
dltranspconv | Weights |
|
fullyconnect | Weights | The number of input channels of the operation |
gru | Input weights | The number of input channels of the operation |
Recurrent weights | The number of hidden units of the operation. | |
lstm | Input weights | The number of input channels of the operation |
Recurrent weights | The number of hidden units of the operation. |
To initialize learnable parameters using the He initializer easily, you can define a
custom function. The function initializeHe
, takes as input the size
of the learnable parameters sz
, the value
Ni, and returns the sampled weights as
a dlarray
object with underlying type
'single'
.
function weights = initializeHe(sz,numIn) weights = randn(sz,'single') * sqrt(2/numIn); weights = dlarray(weights); end
Initialize the weights for a convolutional operation with 128 filters of size 5-by-5 and 3 input channels.
filterSize = [5 5]; numChannels = 3; numFilters = 128; sz = [filterSize numChannels numFilters]; numIn = prod(filterSize) * numFilters; parameters.conv.Weights = initializeHe(sz,numIn);
The Gaussian initializer samples weights from a normal distribution.
To initialize learnable parameters using the Gaussian initializer easily, you can
define a custom function. The function initializeGaussian
, takes as
input the size of the learnable parameters sz
, the distribution mean
mu
, the distribution standard deviation sigma
,
and returns the sampled weights as a dlarray
object with underlying
type
'single'
.
function weights = initializeGaussian(sz,mu,sigma) weights = randn(sz,'single')*sigma + mu; weights = dlarray(weights); end
Initialize the weights for an embedding operation with dimension 300 and vocabulary size 5000 using the Gaussian initializer with mean 0 and standard deviation 0.01.
embeddingDimension = 300; vocabularySize = 5000; mu = 0; sigma = 0.01; sz = [embeddingDimension vocabularySize]; parameters.emb.Weights = initializeGaussian(sz,mu,sigma);
The uniform initializer samples weights from a uniform distribution.
To initialize learnable parameters using the uniform initializer easily, you can
define a custom function. The function initializeUniform
, takes as
input the size of the learnable parameters sz
, the distribution bound
bound
, and returns the sampled weights as a
dlarray
object with underlying type
'single'
.
function parameter = initializeUniform(sz,bound) Z = 2*rand(sz,'single') - 1; parameter = bound * Z; parameter = dlarray(parameter); end
Initialize the weights for an attention mechanism with size 100-by-100 and bound 0.1 using the uniform initializer.
sz = [100 100]; bound = 0.1; parameters.attentionn.Weights = initializeUniform(sz,bound);
The orthogonal initializer returns the orthogonal matrix Q given by the QR decomposition of Z = QR, where Z is sampled from a unit normal distribution and the size of Z matches the size of the learnable parameter.
To initialize learnable parameters using the orthogonal initializer easily, you can
define a custom function. The function initializeOrthogonal
, takes as
input the size of the learnable parameters sz
and returns the
orthogonal matrix as a dlarray
object with underlying type
'single'
.
function parameter = initializeOrthogonal(sz) Z = randn(sz,'single'); [Q,R] = qr(Z,0); D = diag(R); Q = Q * diag(D ./ abs(D)); parameter = dlarray(Q); end
Initialize the recurrent weights for an LSTM operation with 100 hidden units using the orthogonal initializer.
numHiddenUnits = 100; sz = [4*numHiddenUnits numHiddenUnits]; parameters.lstm.RecurrentWeights = initializeOrthogonal(sz);
The unit forget gate initializer initializes the bias for an LSTM operation such that the forget gate component of the biases are ones and the remaining entries are zeros.
To initialize learnable parameters using the orthogonal initializer easily, you can
define a custom function. The function initializeUnitForgetGate
,
takes as input the number of hidden units in the LSTM operation and returns the bias as
a dlarray
object with underlying type
'single'
.
function bias = initializeUnitForgetGate(numHiddenUnits) bias = zeros(4*numHiddenUnits,1,'single'); idx = numHiddenUnits+1:2*numHiddenUnits; bias(idx) = 1; bias = dlarray(bias); end
Initialize the bias of an LSTM operation with 100 hidden units using the unit forget gate initializer.
numHiddenUnits = 100;
parameters.lstm.Bias = initializeUnitForgetGate(numHiddenUnits,'single');
To initialize learnable parameters with ones easily, you can define a custom function.
The function initializeOnes
, takes as input the size of the learnable
parameters sz
and returns parameters as a dlarray
object with underlying type
'single'
.
function parameter = initializeOnes(sz) parameter = ones(sz,'single'); parameter = dlarray(weights); end
Initialize the scale for a batch normalization operation with 128 input channels with ones.
numChannels = 128; sz = [numChannels 1]; parameters.bn.Scale = initializeOnes(sz);
To initialize learnable parameters with zeros easily, you can define a custom
function. The function initializeZeros
, takes as input the size of
the learnable parameters sz
and returns parameters as a
dlarray
object with underlying type
'single'
.
function parameter = initializeZeros(sz) parameter = zeros(sz,'single'); parameter = dlarray(weights); end
Initialize the offset for a batch normalization operation with 128 input channels with zeros.
numChannels = 128; sz = [numChannels 1]; parameters.bn.Offset = initializeZeros(sz);
[1] Glorot, Xavier, and Yoshua Bengio. "Understanding the difficulty of training deep feedforward neural networks." In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp. 249-256. 2010.
[1] He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. "Delving deep into rectifiers: Surpassing human-level performance on imagenet classification." In Proceedings of the IEEE international conference on computer vision, pp. 1026-1034. 2015.
dlarray
| dlfeval
| dlgradient
| dlnetwork