A softmax layer applies a softmax function to the input.
For classification problems, a softmax layer and then a classification layer must follow the final fully connected layer.
The output unit activation function is the softmax function:
where and .
The softmax function is the output unit activation function after the last fully connected layer for multi-class classification problems:
where and . Moreover, , is the conditional probability of the sample given class r, and is the class prior probability.
The softmax function is also known as the normalized exponential and can be considered the multi-class generalization of the logistic sigmoid function [1].