This example shows how to define a text encoder model function.
In the context of deep learning, an encoder is the part of a deep learning network that maps the input to some latent space. You can use these vectors for various tasks. For example,
Classification by applying a softmax operation to the encoded data and using cross entropy loss.
Sequence-to-sequence translation by using the encoded vector as a context vector.
The file sonnets.txt
contains all of Shakespeare's sonnets in a single text file.
Read the Shakespeare's Sonnets data from the file "sonnets.txt"
.
filename = "sonnets.txt";
textData = fileread(filename);
The sonnets are indented by two whitespace characters. Remove the indentations using replace
and split the text into separate lines using the split
function. Remove the header from the first nine elements and the short sonnet titles.
textData = replace(textData," ",""); textData = split(textData,newline); textData(1:9) = []; textData(strlength(textData)<5) = [];
Create a function that tokenizes and preprocesses the text data. The function preprocessText
, listed at the end of the example, performs these steps:
Prepends and appends each input string with the specified start and stop tokens, respectively.
Tokenize the text using tokenizedDocument
.
Preprocess the text data and specify the start and stop tokens "<start>"
and "<stop>"
, respectively.
startToken = "<start>"; stopToken = "<stop>"; documents = preprocessText(textData,startToken,stopToken);
Create a word encoding object from the tokenized documents.
enc = wordEncoding(documents);
When training a deep learning model, the input data must be a numeric array containing sequences of a fixed length. Because the documents have different lengths, you must pad the shorter sequences with a padding value.
Recreate the word encoding to also include a padding token and determine the index of that token.
paddingToken = "<pad>";
newVocabulary = [enc.Vocabulary paddingToken];
enc = wordEncoding(newVocabulary);
paddingIdx = word2ind(enc,paddingToken)
paddingIdx = 3595
The goal of the encoder is to map sequences of word indices to vectors in some latent space.
Initialize the parameters for the following model.
This model uses three operations:
The embedding maps word indices in the range 1 though vocabularySize
to vectors of dimension embeddingDimension
, where vocabularySize
is the number of words in the encoding vocabulary and embeddingDimension
is the number of components learned by the embedding.
The LSTM operation takes as input sequences of word vectors and outputs 1-by-numHiddenUnits
vectors, where numHiddenUnits
is the number of hidden units in the LSTM operation.
The fully connected operation multiplies the input by a weight matrix adding bias and outputs vectors of size latentDimension
, where latentDimension
is the dimension of the latent space.
Specify the dimensions of the parameters.
embeddingDimension = 100; numHiddenUnits = 150; latentDimension = 50; vocabularySize = enc.NumWords;
Create a struct for the parameters.
parameters = struct;
Initialize the weights of the embedding using the Gaussian using the initializeGaussian
function which is attached to this example as a supporting file. Specify a mean of 0 and a standard deviation of 0.01. To learn more, see Gaussian Initialization (Deep Learning Toolbox).
mu = 0; sigma = 0.01; parameters.emb.Weights = initializeGaussian([embeddingDimension vocabularySize],mu,sigma);
Initialize the learnable parameters for the encoder LSTM operation:
Initialize the input weights with the Glorot initializer using the initializeGlorot
function which is attached to this example as a supporting file. To learn more, see Glorot Initialization (Deep Learning Toolbox).
Initialize the recurrent weights with the orthogonal initializer using the initializeOrthogonal
function which is attached to this example as a supporting file. To learn more, see Orthogonal Initialization (Deep Learning Toolbox).
Initialize the bias with the unit forget gate initializer using the initializeUnitForgetGate
function which is attached to this example as a supporting file. To learn more, see Unit Forget Gate Initialization (Deep Learning Toolbox).
The sizes of the learnable parameters depend on the size of the input. Because the inputs to the LSTM operation are sequences of word vectors from the embedding operation, the number of input channels is embeddingDimension
.
The input weight matrix has size 4*numHiddenUnits
-by-inputSize
, where inputSize
is the dimension of the input data.
The recurrent weight matrix has size 4*numHiddenUnits
-by-numHiddenUnits
.
The bias vector has size 4*numHiddenUnits
-by-1.
sz = [4*numHiddenUnits embeddingDimension]; numOut = 4*numHiddenUnits; numIn = embeddingDimension; parameters.lstmEncoder.InputWeights = initializeGlorot(sz,numOut,numIn); parameters.lstmEncoder.RecurrentWeights = initializeOrthogonal([4*numHiddenUnits numHiddenUnits]); parameters.lstmEncoder.Bias = initializeUnitForgetGate(numHiddenUnits);
Initialize the learnable parameters for the encoder fully connected operation:
Initialize the weights with the Glorot initializer.
Initialize the bias with zeros using the initializeZeros
function which is attached to this example as a supporting file. To learn more, see Zeros Initialization (Deep Learning Toolbox).
The sizes of the learnable parameters depend on the size of the input. Because the inputs to the fully connected operation are the outputs of the LSTM operation, the number of input channels is numHiddenUnits
. To make the fully connected operation output vectors with size latentDimension
, specify an output size of latentDimension
.
The weights matrix has size outputSize
-by-inputSize
, where outputSize
and inputSize
correspond to the output and input dimensions, respectively.
The bias vector has size outputSize
-by-1.
sz = [latentDimension numHiddenUnits]; numOut = latentDimension; numIn = numHiddenUnits; parameters.fcEncoder.Weights = initializeGlorot(sz,numOut,numIn); parameters.fcEncoder.Bias = initializeZeros([latentDimension 1]);
Create the function modelEncoder
, listed in the Encoder Model Function section of the example, that computes the output of the encoder model. The modelEncoder
function, takes as input sequences of word indices, the model parameters, and the sequence lengths, and returns the corresponding latent feature vector.
To train the model using a custom training loop, you must iterate over mini-batches of data and convert it into the format required for the encoder model and the model gradients functions. This section of the example illustrates the steps needed for preparing a mini-batch of data inside the custom training loop.
Prepare an example mini-batch of data. Select a mini-batch of 32 documents from documents
. This represents the mini-batch of data used in an iteration of a custom training loop.
miniBatchSize = 32; idx = 1:miniBatchSize; documentsBatch = documents(idx);
Convert the documents to sequences using the doc2sequence
function and specify to right-pad the sequences with the word index corresponding to the padding token.
X = doc2sequence(enc,documentsBatch, ... 'PaddingDirection','right', ... 'PaddingValue',paddingIdx);
The output of the doc2sequence
function is a cell array, where each element is a row vector of word indices. Because the encoder model function requires numeric input, concatenate the rows of the data using the cat
function and specify to concatenate along the first dimension. The output has size miniBatchSize
-by-sequenceLength
, where sequenceLength
is the length of the longest sequence in the mini-batch.
X = cat(1,X{:}); size(X)
ans = 1×2
32 14
Convert the data to a dlarray
with format 'BTC'
(batch, time, channel). The software automatically rearranges the output to have format 'CTB'
so the output has size 1
-by-miniBatchSize
-by-sequenceLength
.
dlX = dlarray(X,'BTC');
size(dlX)
ans = 1×3
1 32 14
For masking, calculate the unpadded sequence lengths of the input data using the doclength
function with the mini-batch of documents as input.
sequenceLengths = doclength(documentsBatch);
This code snippet shows an example of preparing a mini-batch in a custom training loop.
iteration = 0; % Loop over epochs. for epoch = 1:numEpochs % Loop over mini-batches. for i = 1:numIterationsPerEpoch iteration = iteration + 1; % Read mini-batch. idx = (i-1)*miniBatchSize+1:i*miniBatchSize; documentsBatch = documents(idx); % Convert to sequences. X = doc2sequence(enc,documentsBatch, ... 'PaddingDirection','right', ... 'PaddingValue',paddingIdx); X = cat(1,X{:}); % Convert to dlarray. dlX = dlarray(X,'BTC'); % Calculate sequence lengths. sequenceLengths = doclength(documentsBatch); % Evaluate model gradients. % ... % Update learnable parameters. % ... end end
When training a deep learning model with a custom training loop, you must calculate the gradients of the loss with respect to the learnable parameters. This calculation depends on the output of a forward pass of the model function.
To perform a forward pass of the encoder, use the modelEncoder
function directly with the parameters, data, and sequence lengths as input. The output is a latentDimension
-by-miniBatchSize
matrix.
dlZ = modelEncoder(parameters,dlX,sequenceLengths); size(dlZ)
ans = 1×2
50 32
This code snippet shows an example of using a model encoder function inside the model gradients function.
function gradients = modelGradients(parameters,dlX,sequenceLengths) dlZ = modelEncoder(parameters,dlX,sequenceLengths); % Calculate loss. % ... % Calculate gradients. % ... end
This code snippet shows an example of evaluating the model gradients in a custom training loop.
iteration = 0; % Loop over epochs. for epoch = 1:numEpochs % Loop over mini-batches. for i = 1:numIterationsPerEpoch iteration = iteration + 1; % Prepare mini-batch. % ... % Evaluate model gradients. gradients = dlfeval(@modelGradients, parameters, dlX, sequenceLengths); % Update learnable parameters. [parameters,trailingAvg,trailingAvgSq] = adamupdate(parameters,gradients, ... trailingAvg,trailingAvgSq,iteration); end end
The modelEncoder
function, takes as input the model parameters, sequences of word indices, and the sequence lengths, and returns the corresponding latent feature vector.
Because the input data contains padded sequences of different lengths, the padding can have adverse effects on loss calculations. For the LSTM operation, instead of returning the output of the last time step of the sequence (which likely corresponds to the LSTM state after processing lots of padding values), determine the actual last time step given by the sequenceLengths
input.
function dlZ = modelEncoder(parameters,dlX,sequenceLengths) % Embedding. weights = parameters.emb.Weights; dlZ = embed(dlX,weights); % LSTM. inputWeights = parameters.lstmEncoder.InputWeights; recurrentWeights = parameters.lstmEncoder.RecurrentWeights; bias = parameters.lstmEncoder.Bias; numHiddenUnits = size(recurrentWeights,2); hiddenState = zeros(numHiddenUnits,1,'like',dlX); cellState = zeros(numHiddenUnits,1,'like',dlX); dlZ1 = lstm(dlZ,hiddenState,cellState,inputWeights,recurrentWeights,bias); % Output mode 'last' with masking. miniBatchSize = size(dlZ1,2); dlZ = zeros(numHiddenUnits,miniBatchSize,'like',dlZ1); dlZ = dlarray(dlZ,'CB'); for n = 1:miniBatchSize t = sequenceLengths(n); dlZ(:,n) = dlZ1(:,n,t); end % Fully connect. weights = parameters.fcEncoder.Weights; bias = parameters.fcEncoder.Bias; dlZ = fullyconnect(dlZ,weights,bias); end
The function preprocessText
performs these steps:
Prepends and appends each input string with the specified start and stop tokens, respectively.
Tokenize the text using tokenizedDocument
.
function documents = preprocessText(textData,startToken,stopToken) % Add start and stop tokens. textData = startToken + textData + stopToken; % Tokenize the text. documents = tokenizedDocument(textData,'CustomTokens',[startToken stopToken]); end
doc2sequence
| tokenizedDocument
| word2ind
| wordEncoding