Derivatives on hidden layers in backpropagation (ANNs)

Question

I'm working on understanding all the math used in artificial neural networks. I have gotten stuck at calculating the error function derivatives for hidden layers when performing backpropagation.

On page 244 of Bishop's "Pattern recognition and machine learning", formula 5.55. The derivative of the error function for a hidden layer is given using a sum of derivatives over all units to which it sends connections.

$$ \frac{\partial E_n}{\partial a_j} = \sum_k \frac{\partial E_n}{\partial a_k} \frac{\partial a_k}{\partial a_j}$$

I know the chain rule. If $a_j$ goes into only one other node, we can apply the chain rule to separate the parts. But what is the intuition behind summing these values for all nodes if the output goes into multiple nodes?

Thanks

It makes no sense with total derivatives. In the book it's written with partial derivatives. The $\TeX$ command for the partial derivative symbol is \partial. There's an edit link underneath the question. It might also be a good idea to link to the book in the question so people can look up the context. — joriki, Feb 20 '13 at 18:22
The formula can also be seen here: http://web.cs.swarthmore.edu/~meeden/cs81/s10/BackPropDeriv.pdf However, I haven't found any rule or explanation why this summing works and is correct. — Marek, Feb 20 '13 at 22:43
This is backpropagation, so the error information is flowing backward through the network. The index $k$ iterates over all the nodes to which node $j$ supplies information in the forward sense; intuitively, error in node $a_k$ for each index $k$ is due, in some part, to the action of node $j$. — Emily, Feb 20 '13 at 22:44
Yes, I understand that. But why does this sum work? How does summing over the right side equal the left side? Is there a rule for that? I tried to derive this on my own and hit a dead end when the signal is sent into multiple nodes. — Marek, Feb 20 '13 at 22:52
@Marek: Are you aware of the chain rule for multivariate functions? — joriki, Feb 20 '13 at 23:12

score 2 · Answer 1 · answered Dec 04 '19 at 00:59

Thinking in terms of directional derivatives might be more instructive in this case, as we can arrive at the chain rule formulation in a constructive fashion.

Let us consider the directional derivative of a multivariate function $f : \mathbb{R^n} \rightarrow \mathbb{R}$, in an arbitrary direction $u \in \mathbb{R^n}$. Since $u$ is a direction, we shall assume $||\textbf{u}|| = 1$. The directional derivative in the direction of $\textbf{u}$, is then defined as

$$\nabla_u f = \nabla f \cdot \textbf{u}$$

This can easily be proved by noting, that under the usual limit definition of a derivative, we have

$$\nabla_u f (\textbf{x}) = \lim_{h\rightarrow0} \frac{f(\textbf{x} + h\textbf{u}) - f(\textbf{x})}{h }$$

Since we assume differentiability of the objective function $f$, we can find a linear approximant of $f$ around any arbitrary point $\textbf{a}$ that is close to the true value of $f(\textbf{x})$ in any $\epsilon$-neighborhood of $\textbf{a}$ $$f(\textbf{x}) = f(\textbf{a}) + \nabla f(\textbf{a})^T \cdot (\textbf{x} - \textbf{a})$$

Plugging this approximant into our previous limit formulation we get, for any $\textbf{a}$

$$\nabla_u f (\textbf{a}) = \nabla f(\textbf{a}) \cdot \textbf{u}$$

Going back to the original problem, when differentiating the cost function $E_n$ by an arbitrary parameter $a_k$ we also need to take into account the perturbations induced by a change in $a_k$ in any other parameters it interacts with. Specifically, when altering $a_k$ we shall also move along the $\textbf{direction}$ of the perturbed parameters.

This direction essentially encapsulates all the changes caused in intermediate variables $\{a_k\}_{k=1}^{n}$ that appear when altering a certain $a_j$. As such, for every $a_k$ that is directly influenced by $a_j$, we can measure the actual change by evaluating $\frac{\partial a_k}{\partial a_j}$.

Let $\textbf{p}$ be the aforementioned vector, therefore we can write

$$\textbf{p}_k = \bigg(\frac{\partial a_k}{\partial a_j}\bigg)$$

where $k$ runs through all parameters $a_k$ such that $a_k $ is influenced by $a_j$.

Coupling the directional derivative with the previously described concept, we arrive at the desired result, namely

$$\frac{\partial E_n}{\partial a_j} = \nabla E_n ^T \cdot \textbf p = \sum_{k \in \mathcal{S}} \frac{\partial E_n}{\partial a_k} \frac{\partial a_k}{\partial a_j}$$

where $\mathcal{S}$ is the set of all indices corresponding to variables directly influenced by $a_j$.

score 0 · Answer 2 · answered Jun 28 '19 at 04:03

Let $E_n$ be the error on a single datum, dependent on activations $a_j$, i.e., $$ E_n = \frac{1}{2}\sum_k (y_{nk}(x_n) - t_{nk})^2, $$ where $y$ depends on the activations of the last layer through a variable $z$: $$ z_j = h(a_j)=h\left( \sum_{i\in U_j} w_{ji} z_i \right). $$ We have here defined $U_j$ as the set of indices of "nodes" in the previous layer that affect the output of the current node $j$, i.e., such that each node's output $z_i$ is fed into the above equation to compute $z_j$. We may have something like $y_{nk}=z_k(x_n)$, in other words.

I think this notation is very confusing, because $z_\alpha$ is being used to denote the activations of any node in the network, regardless of which layer they are in; that is, the notation does not make the layer number explicit. So, iterating over $z_\alpha$ will iterate over every node in the network. The key is that the summation is only over the input nodes for a given activation; i.e., the index set is what differentiates the nodes.

Anyway, to quote Bishop:

We again make use of the chain rule for partial derivatives, $$ \frac{\partial E_n}{\partial a_j} = \sum_{k\in U_j} \frac{\partial E_n}{\partial a_k} \frac{\partial a_k}{\partial a_j} $$ where the sum runs over all units k to which unit j sends connections.

I've italicized the relevant parts and added the $U_j$ index set for clarity.

Recall the multivariate chain rule for $f(r(q))=f(r_1(q),\ldots,r_n(q))$: $$ \nabla_q \widehat{f}(q)=\frac{\partial}{\partial q} f(r(q)) = \left( \frac{\partial}{\partial q_1} f(r(q)),\ldots,\frac{\partial}{\partial q_m} f(r(q)) \right), $$ where $f :\mathbb{R}^n\rightarrow\mathbb{R}$ and $r_k:\mathbb{R}^m\rightarrow\mathbb{R}$, so $$ \frac{\partial}{\partial q_m} f(r(q)) = \sum_s \frac{\partial f}{\partial r_s}\frac{\partial r_s}{\partial q_m}. $$ Let's apply it to our case. We have $ f = E_n $, $ r_i = a_i \;\forall\; i\in U_j $, and $ q = a_j $, so $f(r(q)) \equiv E_n(A_j(a_j))$, where $ A_j(a_j) = (r_1(a_j),\ldots,r_{|U_j|}(a_j)) $.

Keep in mind that in the forward pass $a_j$ is being passed to (i.e., affecting) each node in $U_j$ to compute $E_n$, while in the backward pass when computing $\partial E_n/\partial a_j$ you need to sum over the contribution for each $U_j$.

Plugging these values into the chain rule formula we get $$ \frac{\partial}{\partial a_j} E_n(A_j(a_j)) =\frac{\partial}{\partial a_j} E_n(r_1(a_j),\ldots,r_{|U_j|}(a_j)) = \sum_{s\in U_j} \frac{\partial f}{\partial r_s}\frac{\partial r_s}{\partial a_j} = \sum_{k\in U_j} \frac{\partial E_n}{\partial a_k}\frac{\partial a_k}{\partial a_j} $$

Derivatives on hidden layers in backpropagation (ANNs)

2 Answers2

Linked