Mastering the Softmax Function: Understanding its Derivative with a Step-by-Step Example
This article focuses on obtaining the derivative
of the softmax function by means of a simple example. It assumes that the reader is familiar with standard high-school single-variable calculus.
The challenge in computing the derivative of the softmax function arises from the requisite understanding of multivariable calculus.
Therefore, after familiarizing ourselves with the function itself, we will look at the necessary mathematical concepts and terms needed for computing its derivative.
Finally, we apply this knowledge in the third part to derive a concise, general description of the softmax function’s derivative using a concrete example.
0.1. The softmax function $\sigma$
From wikipedia:
The softmax function is a function that maps a real-valued vector $\boldsymbol{z}$
of length $K$
to a vector whose values sum to one:
\begin{equation}
\sigma(\boldsymbol{z}) = \frac{e^{z_i}}{\sum_{j=1}^K e^{z_j}}
\end{equation}
with $i$=1,..,$K$ and
$\mathbf{z} = (z_1,…z_K) \in \mathbb{R}$.
Thus $\sigma: \mathbb{R}^K \mapsto (0,1)^K$.
The current definition seems a bit complex, so let’s simplify it with an illustrative example.
Consider the softmax function $\sigma(\boldsymbol{z})$, where $\boldsymbol{z}$ is a real-valued vector of length $K$.
For instance, with $K=3$, $\boldsymbol{z} = \begin{bmatrix} 1\\2\\3\end{bmatrix}$. When applying the softmax function, each element $y_i$ of the output vector $\boldsymbol{y}$
is calculated as $\frac{e^{z_i}}{\sum_{j=1}^K e^{z_j}}$. In our example, this translates to
\begin{equation}
\sigma(\boldsymbol{z})= \sigma\left(\begin{bmatrix} z_1\\z_2\\z_3\end{bmatrix}\right) =\sigma\left(\begin{bmatrix} 1\\2\\3\end{bmatrix}\right) = \boldsymbol{y}=
\begin{bmatrix}
y_1\\
y_2\\
y_3\\
\end{bmatrix}
=
\begin{bmatrix}
\frac{e^{z_1}}{e^{z_1}+e^{z_2}+e^{z_3}}\\
\frac{e^{z_2}}{e^{z_1}+e^{z_2}+e^{z_3}}\\
\frac{e^{z_3}}{e^{z_1}+e^{z_2}+e^{z_3}}
\end{bmatrix}
=
\begin{bmatrix}
\frac{e^{1}}{e^{1}+e^{2}+e^{3}}\\
\frac{e^{2}}{e^{1}+e^{2}+e^{3}}\\
\frac{e^{3}}{e^{1}+e^{2}+e^{3}}
\end{bmatrix}
\end{equation}
Essentially, the softmax function transforms an input vector of length $K$ into an output vector of the same length, with each output element determined by the exponential of its corresponding input element divided by the sum of exponentials of all input elements. The role of the exponential function is to map any value to a positive number. The division ensures that the the sum of the output vector’s elements (which each map to the open interval (0,1) ) always equals one, aligning with the concept of probabilities. This property makes the softmax function particularly useful
when dealing with probability distributions.
Now, that we have an idea about what the softmax function does, let’s look at how to find its derivative.
0.2. Partial Derivatives, the Differential Operator, and the Jacobian
If you are unfamiliar with multivariable (or multivariate) calculus, please continue reading to understand how the derivative of the softmax function is derived. This section aims to provide a
simplified explanation, rather than a rigorous mathematical introduction to the topic. In single-variable calculus, we typically deal with functions that take one variable as input and map it
to some output. However, multivariate functions, such as $f(x, y) = x^2 + 3y = z$, involve multiple input variables. To compute the derivative of such functions, we need to determine all their
\textbf{partial derivatives}. A partial derivative with respect to a certain variable is obtained by treating all other variables as constants. For example, $\partial f/ \partial x$
represents the partial derivative of $f(x, y)$ with respect to $x$, treating $y$ as a constant.
Here are all partial derivatives for the example function $f(x, y) = x^2 + 3y $:
\begin{equation}
\begin{aligned}
\frac{\partial f}{\partial x} &= \frac{\partial}{\partial x}(x^2+3y) = 2x \\
\frac{\partial f}{\partial y} &= \frac{\partial}{\partial y}(x^2+3y) = 3
\end{aligned}
\end{equation}
In the upper equation we derived the function w.r.t. (with respect to) $x$ which is written as
$\frac{\partial f}{\partial x}$ and read as “partial f partial x”. Here, we treated $y$ as a constant. And in the lower equation we derived the function w.r.t. $y$ and therefore treated $x$ as a constant.
The output of a multivariate function can be a scalar or a vector. When a multivariate function maps to a scalar value, it’s referred to as a scalar-valued multivariate function.
The \textbf{differential operator} $\nabla$ (pronounced “del”) is commonly used in vector calculus and can be applied to such scalar-valued functions.
It indicates the direction of steepest ascent of the function by packaging each partial derivative of the output into a vector, known as the gradient.
Here is an example for the above function:
\begin{equation}
\begin{aligned}
\nabla(f) =
\begin{bmatrix}
\frac{\partial f}{\partial x} \\
\frac{\partial f}{\partial y}
\end{bmatrix}
=
\begin{bmatrix}
2x \\
3
\end{bmatrix}
\end{aligned}
\end{equation}
Sometimes the gradient is also expressed as a row vector.
We don’t really need the differential operator in this article, I merely included it, because it’s sometimes used
in machine learning literature to write the Jacobian in a more compact way.
If the output of a multivariate function is a vector we call this a multivariate vector-valued function.
For example
$f(\mathbf{x})=\mathbf{y}$ with
$\mathbf{x} \in \mathbb{R}^n$ and $\mathbf{y} \in \mathbb{R}^m $ is a multivariate vector-valued function.
The generalization of the derivative for such a function is called the \textbf{Jacobian}. It comprises all $n$ partial derivatives of each of the $m$ elements of the output vector.
Therefore, it is represented as a matrix $\mathbf{J}_f \in \mathbb{R}^{m \times n}$, where the subscript $f$ indicates the function for which we compute the Jacobian.
The Jacobian matrix provides valuable information about how each element of the output vector changes with respect to each input variable, making it
essential for understanding the behavior of multivariate vector-valued functions
\begin{equation}
\mathbf{J}_f=\begin{bmatrix}\frac{\partial y_1}{\partial x_1}&&
\frac{\partial y_1}{\partial x_2}&&
\ldots && \frac{\partial y_1}{\partial x_n}
\\
\frac{\partial y_2}{\partial x_1}&&
\frac{\partial y_2}{\partial x_2}
&&
\ldots &&
\frac{\partial y_2}{\partial x_n}
\\
\vdots
\\
\frac{\partial y_m}{\partial x_1}
&& \frac{\partial y_m}{\partial
x_2} &&
\ldots &&
\frac{\partial y_m}{\partial x_n}
\end{bmatrix}
\end{equation}
Realize that we can also write this as
a vector of gradients (in which case each gradient is interpreted as a row vector).
\begin{equation}
\mathbf{J}_f=
\begin{bmatrix}
\nabla y_1
\\
\nabla y_2\\
\vdots\\
\nabla y_m\\
\end{bmatrix}
\end{equation}
Let’s look at a numerical example to familiarize ourselves with the concept of the Jacobian. Let’s take a function that takes in two variables, $x_1$ and $x_2$, and maps these to three output variables, $y_1$, $y_2$, and $y_3$, thus
$f: \mathbb{R}^{2} \to\mathbb{R}^{3}$:
\begin{equation}
\begin{aligned}
f\left(\begin{bmatrix}x_1\\ x_2\end{bmatrix}\right) =
\begin{bmatrix}
y_1\\
y_2\\
y_3
\end{bmatrix}
=
\begin{bmatrix}e^{x_1}+e^{x_2}\\
e^{x_1}x_2\\
4x_1+3{x_2}^3
\end{bmatrix}
\end{aligned}
\end{equation}
The Jacobian comprises the partial derivatives with respect to each of the two input variables $x_1$ and $x_2$ for each of the three output variables, thus:
\begin{equation}
\mathbf{J}_f=\begin{bmatrix}\frac{\partial y_1}{\partial
x_1}&& \frac{\partial y_1}
{\partial x_2}
\\
\frac{\partial y_2}{\partial
x_1}&& \frac{\partial y_2}
{\partial x_2}
\\
\frac{\partial y_3}{\partial
x_1}&& \frac{\partial y_3}
{\partial x_2}
\end{bmatrix}
=
\begin{bmatrix}
e^{x_1} && e^{x_2} \\
e^{x_1}x_2 && e^{x_1} \\
4 && 9 {x_2}^2
\end{bmatrix}
\end{equation}
The take-away message from this paragraph is that for a multivariate vector-valued function $f: \mathbb{R}^{n} \to\mathbb{R}^{m}$, the equivalent of its derivative is its Jacobian $\mathbf{J}_f \in \mathbb{R}^{m\times n}$.
This Jacobian comprises
all $n$ partial derivatives w.r.t. each of the $m$ output variables.
With this understanding, we are now prepared to compute the Jacobian for the softmax function.
0.3. The Derivative of the Softmax Function
The softmax function $\sigma$ is multivariate, because its input is a vector. In addition, it’s also vector-valued, because its output is a vector. We know (from the preceding paragraph) that the derivative of such a function
is given by its Jacobian.
Since in the case of the softmax in- and output are of length $K$ its Jacobian is of dimension $ \mathbb{R}^{K\times K}$:
\begin{equation}
\mathbf{J}_{\sigma} =
\begin{bmatrix}
\nabla y_1 \\
\nabla y_2\\
\vdots\\
\nabla y_K
\end{bmatrix}
=
\begin{bmatrix}
\frac{\partial y_1}{\partial z_1}&& \hdots &&\frac{\partial y_1 }{\partial z_K}\\
\vdots\\
\frac{\partial y_K }{\partial z_1}&& \hdots &&\frac{\partial y_K}{\partial z_K}
\end{bmatrix}
\end{equation}
For illustrative purposes, let’s use an input vector of length 3 (the assumption of a fixed length doesn’t lead to a loss of generality, as we will soon see),
$\boldsymbol{z} = \begin{bmatrix} z_1\\z_2\\z_3\end{bmatrix}$. For this input vector we obtain the following Jacobian for the softmax function $\sigma(\boldsymbol{z})$:
\begin{equation}
\begin{aligned}
\mathbf{J}_{\sigma} &=
\begin{bmatrix}
\frac{\partial y_1}{\partial z_1}&& \frac{\partial y_1}{\partial z_2} && \frac{\partial y_1}{\partial z_3 }\\
\frac{\partial y_2}{\partial z_1}&& \frac{\partial y_2}{\partial z_2} && \frac{\partial y_2}{\partial z_3 }\\
\frac{\partial y_3}{\partial z_1}&& \frac{\partial y_3}{\partial z_2} && \frac{\partial y_3}{\partial z_3 }
\end{bmatrix}\\
&=
\begin{bmatrix}
\frac{\partial }{\partial z_1} (\frac{e^{z_1}}{e^{z_1}+e^{z_2}+e^{z_3}}) && \frac{\partial } {\partial z_2}(\frac{e^{z_1}}{e^{z_1}+e^{z_2}+e^{z_3}}) && \frac{\partial } {\partial z_3} (\frac{e^{z_1}}{e^{z_1}+e^{z_2}+e^{z_3}})\\
\frac{\partial }{\partial z_1} (\frac{e^{z_2}}{e^{z_1}+e^{z_2}+e^{z_3}}) && \frac{\partial } {\partial z_2}(\frac{e^{z_2}}{e^{z_1}+e^{z_2}+e^{z_3}}) && \frac{\partial } {\partial z_3} (\frac{e^{z_2}}{e^{z_1}+e^{z_2}+e^{z_3}})\\
\frac{\partial }{\partial z_1} (\frac{e^{z_3}}{e^{z_1}+e^{z_2}+e^{z_3}}) && \frac{\partial } {\partial z_2}(\frac{e^{z_3}}{e^{z_1}+e^{z_2}+e^{z_3}}) && \frac{\partial } {\partial z_3} (\frac{e^{z_3}}{e^{z_1}+e^{z_2}+e^{z_3}})
\end{bmatrix}
\end{aligned}
\end{equation}
Let’s look at the single matrix entry $\frac{\partial y_1}{\partial z_1}$. We take the partial derivative of the expression $\left(\frac{e^{z_1}}{e^{z_1}+e^{z_2}+e^{z_3}}\right) $ with respect to $z_1$, i.e. we treat all other variables as constants. Thus,
we simply use the quotient rule:
\begin{equation}
\begin{aligned}
&\frac{\partial }{\partial z_1} \left(\frac{e^{z_1}}{e^{z_1}+e^{z_2}+e^{z_3}}\right) =\\
&\frac{\frac{\partial }{\partial z_1} (e^{z_1}) (e^{z_1}+e^{z_2}+e^{z_3}) – \frac{\partial }{\partial z_1} (e^{z_1}+e^{z_2}+e^{z_3}) e^{z_1} }{(e^{z_1}+e^{z_2}+e^{z_3})^2} = \\
&\frac{e^{{z_1}} (e^{z_1}+e^{z_2}+e^{z_3}) – e^{z_1} e^{z_1} }{(e^{z_1}+e^{z_2}+e^{z_3})(e^{z_1}+e^{z_2}+e^{z_3})} = \\
&\left( \frac{e^{z_1}}{(e^{z_1}+e^{z_2}+e^{z_3})} \right) \left( \frac{(e^{z_1}+e^{z_2}+e^{z_3})}{(e^{z_1}+e^{z_2}+e^{z_3})} \right) – \left( \frac{e^{z_1}}{(e^{z_1}+e^{z_2}+e^{z_3})} \right) \left( \frac{e^{z_1}}{(e^{z_1}+e^{z_2}+e^{z_3})}\right) =\\
&\left( \frac{e^{z_1}}{(e^{z_1}+e^{z_2}+e^{z_3})} \right) \left(1 – \left( \frac{e^{z_1}}{(e^{z_1}+e^{z_2}+e^{z_3})} \right) \right) =\\
&\sigma(z_1) (1-\sigma(z_1))
\end{aligned}
\end{equation}
So here we derived $\frac{\partial y_1}{\partial z_1}$, the first output variable $y_1$ with respect to the first input variable $z_1$. Let’s denote the subscript of
the output variable as $i$ and the subscript of the input variable as $j$, so $\frac{\partial y_i}{\partial z_j}$. We can see
that whenever the subscripts are equal, $i=j$, e.g. for $\frac{\partial y_2}{\partial z_2}$, or $\frac{\partial y_3}{\partial z_3}$
we will get the same form of the partial derivative which is $\sigma(z_i) (1-\sigma(z_j))$, which are the diagonal entries of the Jacobian. For our input vector of length 3 we can plug
this into our Jacobian:
\begin{equation}
\mathbf{J}_{\sigma} =
\begin{bmatrix}
\sigma(z_1) (1-\sigma(z_1))&& \frac{\partial y_1}{\partial z_2} && \frac{\partial y_1}{\partial z_3 }\\
\frac{\partial y_2}{\partial z_1}&& \sigma(z_2) (1-\sigma(z_2)) && \frac{\partial y_2}{\partial z_3 }\\
\frac{\partial y_3}{\partial z_1}&& \frac{\partial y_3}{\partial z_2} && \sigma(z_3) (1-\sigma(z_3))
\end{bmatrix}\\
\end{equation}
What about the other entries, where $i\neq j$?
Let’s again look at a concrete example, where $i=1$ and $j=2$, $\frac{\partial y_1}{\partial z_2}$:
\begin{equation}
\begin{aligned}
&\frac{\partial }{\partial z_2} \left(\frac{e^{z_1}}{e^{z_1}+e^{z_2}+e^{z_3}}\right) =\\
&\frac{\frac{\partial}{\partial z_2}(e^{z_1}) (e^{z_1}+e^{z_2}+e^{z_3}) – \frac{\partial}{\partial z_2}(e^{z_1}+e^{z_2}+e^{z_3}) e^{z_1} }{(e^{z_1}+e^{z_2}+e^{z_3})^2} = \\
&\frac{0 – e^{z_2} e^{z_1} }{(e^{z_1}+e^{z_2}+e^{z_3})(e^{z_1}+e^{z_2}+e^{z_3})} = \\
&\frac{-e^{z_2}e^{z_1}}{(e^{z_1}+e^{z_2}+e^{z_3})(e^{z_1}+e^{z_2}+e^{z_3})} \\
&-\sigma(z_1)\sigma(z_2)
\end{aligned}
\end{equation}
We can see that for the softmax function we have in general that if $i\neq j$ the derivative $\frac{\partial y_i}{\partial z_j} = -\sigma(z_i)\sigma(z_j)$.
Again we can fill this into our $3 \times 3$ example matrix:
\begin{equation}
\mathbf{J}_{\sigma} =
\begin{bmatrix}
\sigma(z_1) (1-\sigma(z_1))&& -\sigma(z_1)\sigma(z_2) && -\sigma(z_1)\sigma(z_3)\\
-\sigma(z_2)\sigma(z_1) && \sigma(z_2) (1-\sigma(z_2)) && -\sigma(z_2)\sigma(z_3) \\
-\sigma(z_3)\sigma(z_1) && -\sigma(z_3)\sigma(z_2) && \sigma(z_3) (1-\sigma(z_3))
\end{bmatrix}\\
\end{equation}
Of course, we still need to compute the actual values for each partial derivative, however, that is really easy now, since we just have to plug in
the according $z$-value into the derived formula.
Actually, we can do even better. Instead of writing out a possibly large matrix, (for which we would have to allocate memory in our computer program), we can
compute each value on the fly when needed, using the following formula:
\begin{equation}
\frac{\partial y_i}{\partial z_j} =
\begin{cases}
\sigma(z_i)(1-\sigma(z_j)),& \text{if } i= j\\
-\sigma(z_i)\sigma(z_j), & \text{if } i \neq j
\end{cases}
\end{equation}
This can even be written more elegantly using the Kronecker delta-function, which is a function of two variables $i$ and $j$ and which returns 1 if
$i=j$ and 0 otherwise:
\begin{equation}
\delta_{ij}=
\begin{cases}
1, & \text{if } i= j\\
0, & \text{if } i \neq j
\end{cases}
\end{equation}
With this we can simply write:
\begin{equation}
\frac{\partial y_i}{\partial z_j} = \sigma(z_i)(\delta_{ij}-\sigma(z_j))
\end{equation}
Try it yourself, plug in some numbers and verify that with this formula you can compute each entry of the Jacobian for the softmax function, $\mathbf{J}_\sigma$.
And this is it – this is how to easily compute the derivative of the softmax function, regardless of the length of the input vector $\mathbf{z}$.
In conclusion, we’ve delved into the softmax function, a multivariate vector-valued function. Consequently, its derivative manifests as the Jacobian,
and we’ve established a succinct method for computing this Jacobian, employing the Kronecker-delta function.
Was this helpful?
0 / 0