Specify Custom Layer Backward Function

If Deep Learning Toolbox™ does not provide the layer you require for your classification or regression problem, then you can define your own custom layer. For a list of built-in layers, see List of Deep Learning Layers.

The example Define Custom Deep Learning Layer with Learnable Parameters shows how to create a custom PreLU layer and goes through the following steps:

Name the layer – give the layer a name so that it can be used in MATLAB^®.
Declare the layer properties – specify the properties of the layer and which parameters are learned during training.
Create a constructor function (optional) – specify how to construct the layer and initialize its properties. If you do not specify a constructor function, then at creation, the software initializes the Name, Description, and Type properties with [] and sets the number of layer inputs and outputs to 1.
Create forward functions – specify how data passes forward through the layer (forward propagation) at prediction time and at training time.
Create a backward function (optional) – specify the derivatives of the loss with respect to the input data and the learnable parameters (backward propagation). If you do not specify a backward function, then the forward functions must support dlarray objects.

If the forward function only uses functions that support dlarray objects, then creating a backward function is optional. In this case, the software determines the derivatives automatically using automatic differentiation. For a list of functions that support dlarray objects, see List of Functions with dlarray Support. If you want to use functions that do not support dlarray objects, or want to use a specific algorithm for the backward function, then you can define a custom backward function using this example as a guide.

Create Custom Layer

The example Define Custom Deep Learning Layer with Learnable Parameters shows how to create a PReLU layer. A PReLU layer performs a threshold operation, where for each channel, any input value less than zero is multiplied by a scalar learned at training time.[1] For values less than zero, a PReLU layer applies scaling coefficients $α_{i}$ to each channel of the input. These coefficients form a learnable parameter, which the layer learns during training.

The PReLU operation is given by

$f (x_{i}) = {\begin{matrix} x_{i} & if x_{i} > 0 \\ α_{i} x_{i} & if x_{i} \leq 0 \end{matrix}$

where $x_{i}$ is the input of the nonlinear activation f on channel i, and $α_{i}$ is the coefficient controlling the slope of the negative part. The subscript i in $α_{i}$ indicates that the nonlinear activation can vary on different channels.

View the layer created in the example Define Custom Deep Learning Layer with Learnable Parameters. This layer does not have a backward function.

classdef preluLayer < nnet.layer.Layer
    % Example custom PReLU layer.

    properties (Learnable)
        % Layer learnable parameters
            
        % Scaling coefficient
        Alpha
    end
    
    methods
        function layer = preluLayer(numChannels, name) 
            % layer = preluLayer(numChannels, name) creates a PReLU layer
            % for 2-D image input with numChannels channels and specifies 
            % the layer name.

            % Set layer name.
            layer.Name = name;

            % Set layer description.
            layer.Description = "PReLU with " + numChannels + " channels";
        
            % Initialize scaling coefficient.
            layer.Alpha = rand([1 1 numChannels]); 
        end
        
        function Z = predict(layer, X)
            % Z = predict(layer, X) forwards the input data X through the
            % layer and outputs the result Z.
            
            Z = max(X,0) + layer.Alpha .* min(0,X);
        end
    end
end

Create Backward Function

Implement the backward function that returns the derivatives of the loss with respect to the input data and the learnable parameters.

The syntax for backward is

[dLdX1,…,dLdXn,dLdW1,…,dLdWk] = backward(layer,X1,…,Xn,Z1,…,Zm,dLdZ1,…,dLdZm,memory)

where:

X1,…,Xn are the n layer inputs
Z1,…,Zm are the m outputs of the layer forward functions
dLdZ1,…,dLdZm are the gradients backward propagated from the next layer
memory is the memory output of forward if forward is defined, otherwise, memory is [].

For the outputs, dLdX1,…,dLdXn are the derivatives of the loss with respect to the layer inputs and dLdW1,…,dLdWk are the derivatives of the loss with respect to the k learnable parameters. To reduce memory usage by preventing unused variables being saved between the forward and backward pass, replace the corresponding input arguments with ~.

Tip

If the number of inputs to backward can vary, then use varargin instead of the input arguments after layer. In this case, varargin is a cell array of the inputs, where varargin{i} corresponds to Xi for i=1,…,NumInputs, varargin{NumInputs+j} and varargin{NumInputs+NumOutputs+j} correspond to Zj and dLdZj, respectively, for j=1,…,NumOutputs, and varargin{end} corresponds to memory.

If the number of outputs can vary, then use varargout instead of the output arguments. In this case, varargout is a cell array of the outputs, where varargout{i} corresponds to dLdXi for i=1,…,NumInputs and varargout{NumInputs+t} corresponds to dLdWt for t=1,…,k, where k is the number of learnable parameters.

Because a PReLU layer has only one input, one output, one learnable parameter, and does not require the outputs of the layer forward function or a memory value, the syntax for backward for a PReLU layer is [dLdX,dLdAlpha] = backward(layer,X,~,dLdZ,~). The dimensions of X are the same as in the forward function. The dimensions of dLdZ are the same as the dimensions of the output Z of the forward function. The dimensions and data type of dLdX are the same as the dimensions and data type of X. The dimension and data type of dLdAlpha is the same as the dimension and data type of the learnable parameter Alpha.

During the backward pass, the layer automatically updates the learnable parameters using the corresponding derivatives.

To include a custom layer in a network, the layer forward functions must accept the outputs of the previous layer and forward propagate arrays with the size expected by the next layer. Similarly, when backward is specified, the backward function must accept inputs with the same size as the corresponding output of the forward function and backward propagate derivatives with the same size.

The derivative of the loss with respect to the input data is

$\frac{\partial L}{\partial x_{i}} = \frac{\partial L}{\partial f (x_{i})} \frac{\partial f (x_{i})}{\partial x_{i}}$

where $\partial L / \partial f (x_{i})$ is the gradient propagated from the next layer, and the derivative of the activation is

$\frac{\partial f (x_{i})}{\partial x_{i}} = {\begin{matrix} 1 & if x_{i} \geq 0 \\ α_{i} & {if x}_{i} < 0 \end{matrix} .$

The derivative of the loss with respect to the learnable parameters is

$\frac{\partial L}{\partial α_{i}} = \sum_{j}^{} \frac{\partial L}{\partial f (x_{i j})} \frac{\partial f (x_{i j})}{\partial α_{i}}$

where i indexes the channels, j indexes the elements over height, width, and observations, and the gradient of the activation is

$\frac{\partial f (x_{i})}{\partial α_{i}} = {\begin{matrix} 0 & if x_{i} \geq 0 \\ x_{i} & if x_{i} < 0 \end{matrix} .$

Create the backward function that returns these derivatives.

        function [dLdX, dLdAlpha] = backward(layer, X, ~, dLdZ, ~)
            % [dLdX, dLdAlpha] = backward(layer, X, ~, dLdZ, ~)
            % backward propagates the derivative of the loss function
            % through the layer.
            % Inputs:
            %         layer    - Layer to backward propagate through
            %         X        - Input data
            %         dLdZ     - Gradient propagated from the deeper layer
            % Outputs:
            %         dLdX     - Derivative of the loss with respect to the
            %                    input data
            %         dLdAlpha - Derivative of the loss with respect to the
            %                    learnable parameter Alpha
            
            dLdX = layer.Alpha .* dLdZ;
            dLdX(X>0) = dLdZ(X>0);
            dLdAlpha = min(0,X) .* dLdZ;
            dLdAlpha = sum(dLdAlpha,[1 2]);
    
            % Sum over all observations in mini-batch.
            dLdAlpha = sum(dLdAlpha,4);
        end

Complete Layer

View the completed layer class file.

classdef preluLayer < nnet.layer.Layer
    % Example custom PReLU layer.

    properties (Learnable)
        % Layer learnable parameters
            
        % Scaling coefficient
        Alpha
    end
    
    methods
        function layer = preluLayer(numChannels, name) 
            % layer = preluLayer(numChannels, name) creates a PReLU layer
            % for 2-D image input with numChannels channels and specifies 
            % the layer name.

            % Set layer name.
            layer.Name = name;

            % Set layer description.
            layer.Description = "PReLU with " + numChannels + " channels";
        
            % Initialize scaling coefficient.
            layer.Alpha = rand([1 1 numChannels]); 
        end
        
        function Z = predict(layer, X)
            % Z = predict(layer, X) forwards the input data X through the
            % layer and outputs the result Z.
            
            Z = max(X,0) + layer.Alpha .* min(0,X);
        end
        
        function [dLdX, dLdAlpha] = backward(layer, X, ~, dLdZ, ~)
            % [dLdX, dLdAlpha] = backward(layer, X, ~, dLdZ, ~)
            % backward propagates the derivative of the loss function
            % through the layer.
            % Inputs:
            %         layer    - Layer to backward propagate through
            %         X        - Input data
            %         dLdZ     - Gradient propagated from the deeper layer
            % Outputs:
            %         dLdX     - Derivative of the loss with respect to the
            %                    input data
            %         dLdAlpha - Derivative of the loss with respect to the
            %                    learnable parameter Alpha
            
            dLdX = layer.Alpha .* dLdZ;
            dLdX(X>0) = dLdZ(X>0);
            dLdAlpha = min(0,X) .* dLdZ;
            dLdAlpha = sum(dLdAlpha,[1 2]);
    
            % Sum over all observations in mini-batch.
            dLdAlpha = sum(dLdAlpha,4);
        end
    end
end

GPU Compatibility

If the layer forward functions fully support dlarray objects, then the layer is GPU compatible. Otherwise, to be GPU compatible, the layer functions must support inputs and return outputs of type gpuArray (Parallel Computing Toolbox).

Many MATLAB built-in functions support gpuArray (Parallel Computing Toolbox) and dlarray input arguments. For a list of functions that support dlarray objects, see List of Functions with dlarray Support. For a list of functions that execute on a GPU, see Run MATLAB Functions on a GPU (Parallel Computing Toolbox). To use a GPU for deep learning, you must also have a CUDA^® enabled NVIDIA^® GPU with compute capability 3.0 or higher. For more information on working with GPUs in MATLAB, see GPU Computing in MATLAB (Parallel Computing Toolbox).

References

[1] He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. "Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification." In Proceedings of the IEEE international conference on computer vision, pp. 1026-1034. 2015.

Documentation