29

I was reading an article about convolutional neural networks, and I found something that I don't understand, which is:

The filter must have the same number of channels as the input image so that the element-wise multiplication can take place.

Now, what I don't understand is: What is a channel in a convolutional neural network? I have tried looking for the answer, but can't understand what is it yet.

Can someone explain it to me?

Thanks in advance.

KoKlA
  • 103
  • 3
J.D.
  • 941
  • 6
  • 20
  • 33

4 Answers4

30

Let's assume that we are talking about 2D convolutions applied on images.

In a grayscale image, the data is a matrix of dimensions $w \times h$, where $w$ is the width of the image and $h$ is its height. In a color image, we normally have 3 channels: red, green and blue; this way, a color image can be represented as a matrix of dimensions $w \times h \times c$, where $c$ is the number of channels, that is, 3.

A convolution layer receives the image ($w \times h \times c$) as input, and generates as output an activation map of dimensions $w' \times h' \times c'$. The number of input channels in the convolution is $c$, while the number of output channels is $c'$. The filter for such a convolution is a tensor of dimensions $f \times f \times c \times c'$, where $f$ is the filter size (normally 3 or 5).

This way, the number of channels is the depth of the matrices involved in the convolutions. Also, a convolution operation defines the variation in such depth by specifying input and output channels.

These explanations are directly extrapolable to 1D signals or 3D signals, but the analogy with image channels made it more appropriate to use 2D signals in the example.

noe
  • 28,203
  • 1
  • 49
  • 83
8

This is a frequently asked question because of its confusing nature. So let me try to shed a little light on this.

Channels come from "media". Looking at broadcast technology behind TVs you have mulitple channels for different information that gets broadcasted to your TV. For example an image might consist of only three channels that contain information on how much Red, Green or Blue each pixel in an image is. Mapping this to a CNN you would have an RGB image with three channels. An image however can be interpreted as different things as well. For example you could take information from an image how cyan, magenta, yellow or black something is. This would mean your CMYK image would be analyzed by four channels (each colour being one channel).

In CNNs this means that each of your filters gets applied to each of your channels. Why? Because it might be that your filters get different information from each of the channels. And maybe they converge to different filters after each learning step as well.

Philipp
  • 722
  • 3
  • 10
7

The term channels refers to communication science. It is not a specifc term from data science or artifical intelligence.

In general a channel is transmitting information using signals (A channel has a certain capacity for transmitting information)

For an image these are usually colors (rgb-codes) arranged by pixels, that transmit the actual infromation to the receiver. In the simplest way (digital) colors are created using 3 information (or so called channels --> a mix of Red, Green & Blue). However, images can involve opacity (rgba - here "a" stands for alpha and is the corresponding channel for opacity), or 3D layering (beta channel). The amount of channels can vary for your image.

However, for most images a 3-Channel (rgb) is used.

So when the author of the book writes:

The filter must have the same number of channels as the input image so that the element-wise multiplication can take place.

she means, that mathematical operations (kernel filter) may can not be applied if you pass a 3-Channel to your CNN, whereas you may deal with images that make use of higher channels.

Maeaex1
  • 570
  • 2
  • 15
5

To add up to previous answers, in CNNs and in Neural Networks in general "channels" could be seen as "features".

With CNNs, you typically start with an object of size $W \times H \times 3$ (if RGB image) or $W \times H \times 1$ (if grayscale image). This last dimension could be thought as features dimension. Now, typical CNN (e.g. ResNet) will reduce first two (spatial) dimensions and increase the third dimension (channels) by applying a sequence of alternating convolutions and pooling operations. Thus, effectively you trade spatial dimensions (which is hard to use for classification/regression) in order to vectorize your image (creating channels=features).

Intermediate representations for which spatial dimension is still larger than 1 are typically called feature maps, and final feature map for which spatial dimensions are $1\times 1$ is sometimes called embedding of the image (though the word embedding could be used for any featuremap reshaped into a vector). Finally, this vector representation is used to make image-wise predictions (regression or classification) by applying a one (or more) dense layer.

Anvar Kurmukov
  • 550
  • 3
  • 7