111

This is a small conceptual question that's been nagging me for a while: How can we back-propagate through a max-pooling layer in a neural network?

I came across max-pooling layers while going through this tutorial for Torch 7's nn library. The library abstracts the gradient calculation and forward passes for each layer of a deep network. I don't understand how the gradient calculation is done for a max-pooling layer.

I know that if you have an input ${z_i}^l$ going into neuron $i$ of layer $l$, then ${\delta_i}^l$ (defined as ${\delta_i}^l = \frac{\partial E}{\partial {z_i}^l}$) is given by: $$ {\delta_i}^l = \theta^{'}({z_i}^l) \sum_{j} {\delta_j}^{l+1} w_{i,j}^{l,l+1} $$

So, a max-pooling layer would receive the ${\delta_j}^{l+1}$'s of the next layer as usual; but since the activation function for the max-pooling neurons takes in a vector of values (over which it maxes) as input, ${\delta_i}^{l}$ isn't a single number anymore, but a vector ($\theta^{'}({z_j}^l)$ would have to be replaced by $\nabla \theta(\left\{{z_j}^l\right\})$). Furthermore, $\theta$, being the max function, isn't differentiable with respect to it's inputs.

So....how should it work out exactly?

shinvu
  • 1,240
  • 2
  • 9
  • 7

6 Answers6

111

There is no gradient with respect to non maximum values, since changing them slightly does not affect the output. Further the max is locally linear with slope 1, with respect to the input that actually achieves the max. Thus, the gradient from the next layer is passed back to only that neuron which achieved the max. All other neurons get zero gradient.

So in your example, $\delta_i^l$ would be a vector of all zeros, except that the $i^{*^{th}}$ location will get a values $\left\{\delta_j^{l+1}\right\}$ where $i^* = argmax_{i} (z_i^l)$

Archana David
  • 1,279
  • 7
  • 22
abora
  • 1,228
  • 1
  • 9
  • 3
7

Max Pooling

So suppose you have a layer P which comes on top of a layer PR. Then the forward pass will be something like this:

$ P_i = f(\sum_j W_{ij} PR_j)$,

where $P_i$ is the activation of the ith neuron of the layer P, f is the activation function and W are the weights. So if you derive that, by the chain rule you get that the gradients flow as follows:

$grad(PR_j) = \sum_i grad(P_i) f^\prime W_{ij}$.

But now, if you have max pooling, $f = id$ for the max neuron and $f = 0$ for all other neurons, so $f^\prime = 1$ for the max neuron in the previous layer and $f^\prime = 0$ for all other neurons. So:

$grad(PR_{max\ neuron}) = \sum_i grad(P_i) W_{i\ {max\ neuron}}$,

$grad(PR_{others}) = 0.$

patapouf_ai
  • 426
  • 5
  • 11
6

@Shinvu's answer is well written, I would like to point to a video that explains the gradient of Max() operation and this within a computational graph which is quick to grasp.!

while implementing the maxpool operation(a computational node in a computational graph-Your NN architecture), we need a function creates a "mask" matrix which keeps track of where the maximum of the matrix is. True (1) indicates the position of the maximum in X, the other entries are False (0). We keep track of the position of the max because this is the input value that ultimately influenced the output, and therefore the cost. Backprop is computing gradients with respect to the cost, so anything that influences the ultimate cost should have a non-zero gradient. So, backprop will "propagate" the gradient back to this particular input value that had influenced the cost.

Anu
  • 328
  • 4
  • 10
1

Maybe it is easier to understand the derivative of pooling layer after we write down the matrix format of function max{x}=xW, where x is a tensor. Let us see an example with four different values, x = [x1, x2, x3, x4] and x1 is the largest value, such that max{x} = x1. Now, max{x} can be written down as matrix multiplication, such that xW or Wx', where W = [I(x1>x2)*I(x1>x3)*I(x1>x4), I(x2>x1)*I(x2>x3)*I(x2>x4), I(x3>x1)*I(x3>x2)*I(x3>x4), I(x4>x1)*I(x4>x2)*I(x4>x3)]' = [1, 0, 0, 0]', where I(.) is the identification function to compare two values. Normally, there are two derivatives --- dWx/dx =W' (W transpose) and dWx/dW = x' --- required in the backpropagation algorithm to update the previous and the current layer's weights or biases, respectively. However, in the case of max pooling, there is no need to update W. Because we can and have already written down the closed-form of max pooling layer function, that is W=[I(x1>x2)*I(x1>x3)*I(x1>x4), I(x2>x1)*I(x2>x3)*I(x2>x4), ...]'. Now to find out dWx/dx, we have dWx/dx =W' = [1, 0, 0, 0], and W' can then be inserted as one member in the derivative chain suitably.

In general, the max{x} can be written down as a linear function, such that Wx (or xW), where W is a matrix whose entry is a production of a set of identification functions, like I(x1>x2)*I(x1>x3)*I(x1>x4). One property of W is that, in one column, there is only one entry is 1 and others are all 0. Since this linear function's weights have been determined, and there is no need to use gradient descent to update it anymore. As for the case of using max{x} to update the previous layers' weights and biases using the chain rule, we know that dWx/dx =W'.

Kun Qiu
  • 11
  • 2
0

In my case, although intuitive, I was unable to see why the derivative of the max pooling function equals one for the pooled value. Searching for "derivative of the max pooling function" did not help. But understanding the two variables case did.

For completion, I will discuss the max pooling derivative. First, we consider max pooling as a multivariable function $f$ of the filter map values $f(x_1, \cdots, x_n) = max(x_1, \cdots, x_n)$. Also, we will assume that all values are different (no two max values, see here why I make this simplification). Now, it is better to write the function in bracket notation:

\begin{equation}\label{eq:max_pooling_backprop} f(x_1, \cdots, x_n) = max(x_1, \cdots, x_n) = \begin{cases} x_1 & \text{if } x_1 > x_2, x_1 > x_3, \cdots x_1 > x_n\\ x_2 & \text{if } x_2 > x_1, x_2 > x_3, \cdots x_2 > x_n\\ \vdots & \vdots \quad \vdots \qquad \vdots \qquad \vdots \qquad \vdots \quad\vdots\\ x_n & \text{if } x_n > x_1, x_n > x_3, \cdots x_n > x_{n - 1} \end{cases} \end{equation}

To see why (citing abora above)

the max is locally linear with slope 1, with respect to the input that actually achieves the max,

one can find the partial derivatives of $f$ with respect to $x_1$ (without loss of generality), say:

\begin{equation} \frac{\partial{f}}{x_1} = \begin{cases} 1 & \text{if } x_1 > x_2, x_1 > x_3, \cdots x_1 > x_n\\ 0 & \text{otherwise}, \end{cases} \end{equation}

where is now clear that the partial derivative (gradient) $\frac{\partial{f}}{x_1}$ equals $1$, when the max pooled value was $x_1$. Similarly, $f(x_1, \cdots, x_n) = x_1$ if $x_1$ was the max pooled value. Therefore, $\frac{\partial{f}}{x_2} = \frac{\partial{f}}{x_3} = \cdots = \frac{\partial{f}}{x_n} = 0$.

neoglez
  • 1
  • 1
0
import numpy as np

def max_pool_backward(d_output, input, pool_size):
   """
   Perform back-propagation through a max-pooling layer.

   Parameters:
   - d_output: Gradient of the loss with respect to the output of the max-pooling layer (same shape as the pooled output).
   - input: Input tensor to the max-pooling layer.
   - pool_size: Size of the pooling window (e.g., 2 for 2x2 pooling).

    Returns:
    - d_input: Gradient of the loss with respect to the input of the max-pooling layer (same shape as the input).
    """
    d_input = np.zeros_like(input)
    input_height, input_width = input.shape
    pool_height, pool_width = pool_size, pool_size
    output_height, output_width = d_output.shape

    for i in range(output_height):
        for j in range(output_width):
            # Identify the region in the input corresponding to the current output value
            region = input[i*pool_height:(i+1)*pool_height, j*pool_width:(j+1)*pool_width]
            # Find the index of the max value in the region
            max_index = np.unravel_index(np.argmax(region), region.shape)
            # Assign the gradient from d_output to the position of the max value in d_input
            d_input[i*pool_height + max_index[0], j*pool_width + max_index[1]] =
      d_output[i, j]

    return d_input

tensor = np.array([[1, 2, 3], 
               [4, 5, 6], 
               [7, 8, 9]])
pool_size = 2
d_output = np.array([[1, 1], [1, 1]])
d_input = max_pool_backward(d_output, tensor, pool_size)
print(d_input.shape)
print(d_input)
jeremy
  • 1