Sequence-to-Sequence Classification Using 1-D Convolutions

Open Live Script

This example shows how to classify each time step of sequence data using a generic temporal convolutional network (TCN).

While sequence-to-sequence tasks are commonly solved with recurrent neural network architectures, Bai et al. [1] show that convolutional neural networks can match the performance of recurrent networks on typical sequence modeling tasks or even outperform them. Potential benefits of using convolutional networks can be better parallelism, better control over the receptive field size, better control of the network's memory footprint during training, and more stable gradients. Just like recurrent networks, convolutional networks can operate on variable length input sequences and can be used to model sequence-to-sequence or sequence-to-one tasks.

This example uses sensor data obtained from a smartphone worn on the body and trains a temporal convolutional network as a function using a custom training loop and automatic differentiation to recognize the activity of the wearer given time series data representing accelerometer readings in three different directions.

Load Training Data

Load the human activity recognition data. The data contains seven time series of sensor data obtained from a smartphone worn on the body. Each sequence has three features and varies in length. The three features correspond to the accelerometer readings in three different directions. Six sequences are used for training and one sequence is used for testing after training.

s = load("HumanActivityTrain.mat");

dsXTrain = arrayDatastore(s.XTrain,'OutputType','same');
dsYTrain = arrayDatastore(s.YTrain,'OutputType','same');

dsTrain = combine(dsXTrain,dsYTrain);

Determine the number of observations and the number of classes in the training data.

numObservations = numel(s.XTrain);
classes = categories(s.YTrain{1});
numClasses = numel(classes);

Visualize one training sequence in a plot. Plot the first feature of the first training sequence and color the plot according to the corresponding activity.

X = s.XTrain{1}(1,:);

figure
for j = 1:numel(classes)
    label = classes(j);
    idx = find(s.YTrain{1} == label);
    hold on
    plot(idx,X(idx))
end
hold off

xlabel("Time Step")
ylabel("Acceleration")
title("Training Sequence 1, Feature 1")
legend(classes,'Location','northwest')

Define Deep Learning Model

The main building block of a temporal convolutional network is a dilated causal convolution layer which operates over the time steps of each sequence. In this context, "causal" means that the activations computed for a particular time step cannot depend on activations from future time steps.

In order to build up context from previous time steps, multiple convolutional layers are typically stacked on top of each other. To achieve large receptive field sizes the dilation factor of subsequent convolution layers is increased exponentially as shown in the image below. Assuming the dilation factor of the k-th convolutional layer is $2^{(k - 1)}$ and the stride is 1, then the receptive field size of such a network can be computed as $R = (f - 1) (2^{K} - 1) + 1$ , where $f$ is the filter size and $K$ is the number of convolutional layers. Change the filter size and number of layers to easily adjust the receptive field size and the number or learnable parameters as necessary for the data and task at hand.

One of the disadvantages of TCNs compared to RNNs is the higher memory footprint during inference. The entire raw sequence is required to compute the next time step. To reduce inference time and memory consumption, especially for step-ahead predictions, it can be beneficial to train with the smallest sensible receptive field size $R$ and only perform prediction with the last $R$ time steps of the input sequence.

A general TCN architecture (as described in [1]) consists of multiple residual blocks, each containing two sets of dilated causal convolution layers with the same dilation factor, followed by normalization, ReLU activation, and spatial dropout layers. The input to each block is added to the output of the block (including a 1-by-1 convolution on the input when the number of channels between the input and output do not match) and a final activation function is applied.

Define and Initialize Model Parameters

Specify the parameters for the TCN architecture with four residual blocks containing dilated causal convolution layers with each 175 filters of size 3.

numBlocks = 4;
numFilters = 175;
filterSize = 3;
dropoutFactor = 0.05;

To pass the model hyperparameters to the model functions (the number of blocks and the dropout factor), create a struct containing these values.

hyperparameters = struct;
hyperparameters.NumBlocks = numBlocks;
hyperparameters.DropoutFactor = dropoutFactor;

Create a struct containing dlarray objects for all the learnable parameters of the model based on the number of input channels and the hyperparameters that define the model architecture. Each residual block requires weights parameters and bias parameters for each of the two convolution operations. The first residual block usually also requires weights and biases for an additional convolution operation with filter size 1. The final fully connected operation requires a weights and bias parameter as well. Initialize the learnable layer weights using the function initializeGaussian, listed at the end of the example. Initialize the learnable layer biases with zeros.

numInputChannels = 3;

parameters = struct;
numChannels = numInputChannels;

for k = 1:numBlocks
    parametersBlock = struct;
    blockName = "Block"+k;
    
    weights = initializeGaussian([filterSize, numChannels, numFilters]);
    bias = zeros(numFilters, 1, 'single');
    parametersBlock.Conv1.Weights = dlarray(weights);
    parametersBlock.Conv1.Bias = dlarray(bias);
    
    weights = initializeGaussian([filterSize, numFilters, numFilters]);
    bias = zeros(numFilters, 1, 'single');
    parametersBlock.Conv2.Weights = dlarray(weights);
    parametersBlock.Conv2.Bias = dlarray(bias);
    
    % If the input and output of the block have different numbers of
    % channels, then add a convolution with filter size 1.
    if numChannels ~= numFilters
        weights = initializeGaussian([1, numChannels, numFilters]);
        bias = zeros(numFilters, 1, 'single');
        parametersBlock.Conv3.Weights = dlarray(weights);
        parametersBlock.Conv3.Bias = dlarray(bias);
    end
    numChannels = numFilters;
    
    parameters.(blockName) = parametersBlock;
end

weights = initializeGaussian([numClasses,numChannels]);
bias = zeros(numClasses,1,'single');

parameters.FC.Weights = dlarray(weights);
parameters.FC.Bias = dlarray(bias);

View the network parameters.

parameters

parameters = struct with fields:
    Block1: [1×1 struct]
    Block2: [1×1 struct]
    Block3: [1×1 struct]
    Block4: [1×1 struct]
        FC: [1×1 struct]

View the parameters of the first block.

parameters.Block1

ans = struct with fields:
    Conv1: [1×1 struct]
    Conv2: [1×1 struct]
    Conv3: [1×1 struct]

View the parameters of the first convolutional operation of the first block.

parameters.Block1.Conv1

ans = struct with fields:
    Weights: [3×3×175 dlarray]
       Bias: [175×1 dlarray]

Define Model and Model Gradients Functions

Create the function model, listed in the Model Function section at the end of the example, that computes the outputs of the deep learning model. The function model takes the input data, the learnable model parameters, the model hyperparameters, and a flag that specifies whether the model should return outputs for training or prediction. The network outputs the predictions for the labels at each time step of the input sequence.

Create the function modelGradients, listed in the Model Gradients Function section at the end of the example, that takes a mini-batch of input data, the corresponding target sequences, and the parameters of the network, and returns the gradients of the loss with respect to the learnable parameters and the corresponding loss.

Specify Training Options

Specify a set of training options used in the custom training loop.

Train for 30 epochs with mini-batch size 1.
Start with an initial learn rate of 0.001
Multiply the learn rate by 0.1 every 12 epochs.
Clip the gradients using the $L_{2}$ norm with threshold 1.

maxEpochs = 30;
miniBatchSize = 1;
initialLearnRate = 0.001;
learnRateDropFactor = 0.1;
learnRateDropPeriod = 12;
gradientThreshold = 1;

To monitor the training progress, you can plot the training loss after each iteration. Create the variable plots that contains "training-progress". If you do not want to plot the training progress, then set this value to "none".

plots = "training-progress";

Train Model

Train the network via stochastic gradient descent by looping over the sequences in the training dataset, computing parameter gradients, and updating the network parameters via the Adam update rule. This process is repeated multiple times (referred to as epochs) until training converges and the maximum number of epochs is reached.

Use minibatchqueue to process and manage mini-batches of images during training. For each mini-batch:

Use the custom mini-batch preprocessing function preprocessMiniBatch (defined at the end of this example) to preprocess the sequence data, determine the number of time steps, and one-hot encode the class labels. The function has three outputs, so specify three output variables for the minibatchqueue.
Convert the data to unformated dlarray objects with underlying type single.
Train on a GPU if one is available. By default, the minibatchqueue object converts each output to a gpuArray if a GPU is available. Using a GPU requires Parallel Computing Toolbox™ and a CUDA® enabled NVIDIA® GPU with compute capability 3.0 or higher.

mbq = minibatchqueue(dsTrain,3,...
    'MiniBatchSize',miniBatchSize,...
    'MiniBatchFcn', @preprocessMiniBatch);

For each epoch, shuffle the training data. For each mini-batch:

Evaluate the model gradients and loss using dlfeval and the modelGradients function.
Clip the gradients if they are too large, using the function thresholdL2Norm, listed at the end of the example, and the dlupdate function.
Update the network parameters using the adamupdate function.
Update the training progress plot.

After completing learnRateDropPeriod epochs, reduce the learn rate by multiplying the current learning rate by learnRateDropFactor.

Initialize the learning rate which will be multiplied by the LearnRateDropFactor value every LearnRateDropPeriod epochs.

learnRate = initialLearnRate;

Initialize the moving average of the parameter gradients and the element-wise squares of the gradients used by the Adam optimizer.

trailingAvg = [];
trailingAvgSq = [];

Initialize a plot showing the training progress.

if plots == "training-progress"
    figure
    lineLossTrain = animatedline('Color',[0.85 0.325 0.098]);
    ylim([0 inf])
    xlabel("Iteration")
    ylabel("Loss")
    grid on
end

Train the model.

iteration = 0;

start = tic;

% Loop over epochs.
for epoch = 1:maxEpochs
    
    % Shuffle the data.
    shuffle(mbq)
    
    % Loop over mini-batches.
    while hasdata(mbq)

        iteration = iteration + 1;
        
        [dlX,dlY,numTimeSteps] = next(mbq);
        
        % Evaluate the model gradients and loss using dlfeval.
        [gradients, loss] = dlfeval(@modelGradients,dlX,dlY,parameters,hyperparameters,numTimeSteps);
        
        % Clip the gradients.
        gradients = dlupdate(@(g) thresholdL2Norm(g,gradientThreshold),gradients);
        
        % Update the network parameters using the Adam optimizer.
        [parameters,trailingAvg,trailingAvgSq] = adamupdate(parameters,gradients, ...
            trailingAvg, trailingAvgSq, iteration, learnRate);
        
        if plots == "training-progress"
            % Plot training progress.
            D = duration(0,0,toc(start),'Format','hh:mm:ss');
            
            % Normalize the loss over the sequence lengths            
            loss = mean(loss ./ numTimeSteps);
            loss = double(gather(extractdata(loss)));
            loss = mean(loss);
            
            addpoints(lineLossTrain,iteration, mean(loss));

            title("Epoch: " + epoch + ", Elapsed: " + string(D))
            drawnow
        end
    end
    
    % Reduce the learning rate after learnRateDropPeriod epochs
    if mod(epoch,learnRateDropPeriod) == 0
        learnRate = learnRate*learnRateDropFactor;
    end
end

Test Model

Test the classification accuracy of the model by comparing the predictions on a held-out test set with the true labels for each time step. Manage the test data set using a minibatchqueue object with the same settings as the training data.

s = load("HumanActivityTest.mat");

dsXTest = arrayDatastore(s.XTest,'OutputType','same');
dsYTest = arrayDatastore(s.YTest,'OutputType','same');

dsTest = combine(dsXTest,dsYTest);

mbqTest = minibatchqueue(dsTest,3,...
    'MiniBatchSize',miniBatchSize,...
    'MiniBatchFcn', @preprocessMiniBatch);

To predict the labels of the test data, use the model function with the trained parameters, the hyperparameters, and the doTraining option set to false. Use the onehotdecode function to find the predicted class with the highest score and then compare the prediction with the true label. Evaluate the mean classification accuracy.

doTraining = false;

predictions = [];
predCorr = [];

while hasdata(mbqTest)
    [dlXTest,dlYTest] = next(mbqTest);
    
    dlYPred = model(dlXTest,parameters,hyperparameters,doTraining);
    dlYPred = softmax(dlYPred,'DataFormat','CBT');    
    
    YPred = onehotdecode(dlYPred,classes,1);
    YTest = onehotdecode(dlYTest,classes,1);
    
    predictions = [predictions YPred];
    predCorr = [predCorr YPred == YTest];   
    
end

mean(predCorr)

ans = 0.9996

Compare the predictions for a single sequence with the corresponding test data by using a plot.

figure
idx = 1;
plot(squeeze(predictions),'.-')
hold on
plot(squeeze(YTest))
hold off

xlabel("Time Step")
ylabel("Activity")
title("Predicted Activities")
legend(["Predicted" "Test Data"])

Model Function

The function model takes the input data dlX, the learnable model parameters, the model hyperparameters, and the flag doTraining which specifies whether the model should return outputs for training or prediction. The network outputs the predictions for the labels at each time step of the input sequence. The model consists of multiple residual blocks with exponentially increasing dilation factors. After the last residual block, a final fullyconnect operation maps the output to the number of classes in the target data.

function dlY = model(dlX,parameters,hyperparameters,doTraining)

numBlocks = hyperparameters.NumBlocks;
dropoutFactor = hyperparameters.DropoutFactor;

dlY = dlX;

% Residual blocks.
for k = 1:numBlocks
    dilationFactor = 2^(k-1);
    parametersBlock = parameters.("Block"+k);
    
    dlY = residualBlock(dlY,dilationFactor,dropoutFactor,parametersBlock,doTraining);
end

% Fully connect
weights = parameters.FC.Weights;
bias = parameters.FC.Bias;
dlY = fullyconnect(dlY,weights,bias,'DataFormat','CBT');

end

Residual Block Function

The function residualBlock implements the core building block of the temporal convolutional network.

To apply 1-D causal dilated convolution, use the dlconv function:

To convolve over the spatial dimensions, set the 'DataFormat' option to 'CBS' (use the dimension label 'S' instead of 'T'),
Set the 'DilationFactor' option according to the dilation factor of the residual block.
To ensure only the past time steps are used, apply padding only at the beginning of the sequence.

function dlY = residualBlock(dlX,dilationFactor,dropoutFactor,parametersBlock,doTraining)

% Convolution options.
filterSize = size(parametersBlock.Conv1.Weights,1);
paddingSize = (filterSize - 1) * dilationFactor;

% Convolution.
weights = parametersBlock.Conv1.Weights;
bias =  parametersBlock.Conv1.Bias;
dlY = dlconv(dlX,weights,bias, ...
    'DataFormat','CBS', ...
    'Stride', 1, ...
    'DilationFactor', dilationFactor, ...
    'Padding', [paddingSize; 0] );

% Instance normalization, ReLU, spatial dropout.
dlY = instanceNormalization(dlY,'CBS');
dlY = relu(dlY);
dlY = spatialDropout(dlY,dropoutFactor,'CBS',doTraining);

% Convolution.
weights = parametersBlock.Conv2.Weights;
bias = parametersBlock.Conv2.Bias;
dlY = dlconv(dlY,weights,bias, ...
    'DataFormat','CBS', ...
    'Stride', 1, ...
    'DilationFactor', dilationFactor, ...
    'Padding',[paddingSize; 0] );

% Instance normalization, ReLU, spatial dropout.
dlY = instanceNormalization(dlY,'CBS');
dlY = relu(dlY);
dlY = spatialDropout(dlY,dropoutFactor,'CBS',doTraining);

% Optional 1-by-1 convolution.
if ~isequal(size(dlX),size(dlY))
    weights = parametersBlock.Conv3.Weights;
    bias = parametersBlock.Conv3.Bias;
    dlX = dlconv(dlX,weights,bias,'DataFormat','CBS');
end

% Addition and ReLU
dlY = relu(dlX+dlY);

end

Model Gradients Function

The modelGradients function takes a mini-batch of input data dlX, the corresponding target sequences T, the learnable parameters, and the hyperparameters, and returns the gradients of the loss with respect to the learnable parameters and the corresponding loss. To compute the gradients, evaluate the modelGradients function using the dlfeval function in the training loop.

function [gradients,loss] = modelGradients(dlX,T,parameters,hyperparameters,numTimeSteps)

dlY = model(dlX,parameters,hyperparameters,true);
dlY = softmax(dlY,'DataFormat','CBT');

dlT = dlarray(T,'CBT');
loss = maskedCrossEntropyLoss(dlY, dlT, numTimeSteps);

gradients = dlgradient(mean(loss),parameters);

end

Masked Cross-Entropy Loss Function

The maskedCrossEntropyLoss function computes the cross-entropy loss for mini-batches of sequences, where the sequences are different lengths.

function loss = maskedCrossEntropyLoss(dlY,dlT,numTimeSteps)

numObservations = size(dlY,2);
loss = dlarray(zeros(1,numObservations,'like',dlY));

for i = 1:numObservations
    idx = 1:numTimeSteps(i);
    loss(i) = crossentropy(dlY(:,i,idx),dlT(:,i,idx),'DataFormat','CBT');
end

end

Instance Normalization Function

The instanceNormalization function normalizes the input dlX by first calculating the mean $μ$ and the variance $σ^{2}$ for each observation over each input channel. Then it calculates the normalized activations as

$\hat{X} = \frac{X - μ}{\sqrt{σ^{2} + ϵ}} .$

In comparison to batch normalization, the mean and variance is different for each observation in the mini-batch. Use normalization, such as instance normalization, between convolutional layers and nonlinearities to speed up training of convolutional neural networks and improve convergence.

function dlY = instanceNormalization(dlX,fmt)

reductionDims = find(fmt == 'S');
mu = mean(dlX,reductionDims);
sigmaSq = var(dlX,1,reductionDims);

epsilon = 1e-5;
dlY = (dlX-mu) ./ sqrt(sigmaSq+epsilon);

end

Spatial Dropout Function

The spatialDropout function performs spatial dropout [3] on the input dlX with dimension labels fmt when the doTraining flag is true. Spatial dropout drops an entire channel of the input data. That is, all time steps of a certain channel are dropped with the probability specified by dropoutFactor. The channels are dropped out independently in the batch dimension.

function dlY = spatialDropout(dlX,dropoutFactor,fmt,doTraining)

if doTraining
    maskSize = size(dlX);
    maskSize(fmt=='S') = 1;
    
    dropoutScaleFactor = single(1 - dropoutFactor);
    dropoutMask = (rand(maskSize,'like',dlX) > dropoutFactor) / dropoutScaleFactor;
    
    dlY = dlX .* dropoutMask;
else
    dlY = dlX;
end

end

Weights Initialization Function

The initializeGaussian function samples weights from a Gaussian distribution with mean 0 and standard deviation 0.01.

function parameter = initializeGaussian(sz)

parameter = randn(sz,'single') .* 0.01;

end

Mini-Batch Preprocessing Function

The preprocessMiniBatch function preprocesses the data for training. The N input sequences are transformed into a C-by-N-by-S numeric array of left-padded 1-D sequences and the number of time steps in each sequence, where C corresponds to the number of features of the sequences and S corresponds to the number of time steps of the longest sequence.

The preprocessMiniBatch function preprocesses the data using the following steps:

Calculate the length of the longest sequence.
One-hot encode the categorical labels of each time step into numeric arrays.
Pad the sequences to the same length as the longest sequence in the mini-batch using the leftPad function

function [XTransformed,YTransformed,numTimeSteps] = preprocessMiniBatch(XCell,YCell) 

    numTimeSteps = cellfun(@(sequence) size(sequence,2),XCell);
    sequenceLength = max(cellfun(@(sequence) size(sequence,2),XCell));
    
    miniBatchSize = numel(XCell);
    numFeatures = size(XCell{1},1);    
    classes = categories(YCell{1});
    numClasses = numel(classes);
    
    szX = [numFeatures miniBatchSize sequenceLength];
    XTransformed = zeros(szX,'single');
    
    szY = [numClasses miniBatchSize sequenceLength];
    YTransformed = zeros(szY,'single');
    
    for i = 1:miniBatchSize
        predictors = XCell{i};
        
        responses = onehotencode(YCell{i},1);
        
        % Left pad.
        XTransformed(:,i,:) = leftPad(predictors,sequenceLength);
        YTransformed(:,i,:) = leftPad(responses,sequenceLength);
        
    end

end

Left Padding Function

The leftPad function takes a sequence and left-pads it with zeros to have the specified sequence length.

function sequencePadded = leftPad(sequence,sequenceLength)

[numFeatures,numTimeSteps] = size(sequence);

paddingSize = sequenceLength - numTimeSteps;
padding = zeros(numFeatures,paddingSize);

sequencePadded = [padding sequence];

end

Gradient Clipping Function

The thresholdL2Norm function scales the gradient g so that its $L_{2}$ norm equals gradientThreshold when the $L_{2}$ norm of the gradient is larger than gradientThreshold.

function g = thresholdL2Norm(g,gradientThreshold)

gradientNorm = sqrt(sum(g.^2,'all'));
if gradientNorm > gradientThreshold
    g = g * (gradientThreshold / gradientNorm);
end

end

References

[1] Bai, Shaojie, J. Zico Kolter, and Vladlen Koltun. "An empirical evaluation of generic convolutional and recurrent networks for sequence modeling." arXiv preprint arXiv:1803.01271 (2018).

[2] Van Den Oord, Aäron, et al. "WaveNet: A generative model for raw audio." SSW 125 (2016).

[3] Tompson, Jonathan, et al. "Efficient object localization using convolutional networks." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015.

Documentation