3

I was following Pr. Andrew Ng course on Course about Convolutional neural network and I have a doubt regarding one of the points he mentions in the Yolo algorithm.

In one of the slides he mentioned two key points:

1) For each grid in our $3 \times 3$ grid image will have 2 predicted bounding

2) And each of these bounding boxes will be bigger than the size of the grid.

I couldn't get why there will be $2$ predicted bounding boxes? Is it because we consider two anchor boxes?

Also, how can the bounding boxes bigger than the size of the grid? Because we know that each object can belong to one grid only based on the midpoint

YOLO ALGORITHM

Green Falcon
  • 14,308
  • 10
  • 59
  • 98
Anjith
  • 961
  • 2
  • 11
  • 20

3 Answers3

1

1) Exactly. You have two anchor boxes in Andrew's current example, so the algorithm is going to output two predicted bounding boxes for each grid cell.

2) Your statement below is not true:

"Because we know that each object can belong to one grid only based on the midpoint"

I don't remember that being said on the course. The center of the object belongs to a single cell.

TO CLARIFY: The fact that an object spans at a region greater than the grid cell it is assigned to has nothing to do with the grid cell's size itself. The object can and will be larger than its assigned grid cell. However, the output assigns each object to one grid cell because that grid cell contains its midpoint.

Anyway, the receptive field of the neurons is much larger than the single cell they process (i.e. they cover the whole image). Anchors are initialized on a certain width and height but will be resized during inference, based on the identified object size using the final feature map. So one might consider that yolo predicts the occurrence of an object as well as its size.

More on the receptive field of CNNs

Check the: "A regressor rather than a classifier" part

A taste: For every positive position, the network predicts a regression on the bounding box precise position and dimension

This has been also asked here

Nikos H.
  • 167
  • 10
1

I guess the other answer is sufficient for the question. I just want to add this point that the algorithm uses different anchor boxed due to this fact that the centre of distinct objects may reside on the same pixel, though the real algorithm uses more than two anchor boxes. For instance, you can clearly see the image that he has used in his slide. The centre of the two objects is on the same pixel. You should also consider that the anchor boxes for each class differ, and is unique for each.

Green Falcon
  • 14,308
  • 10
  • 59
  • 98
0

Few points to recollect: 1. Bounding box is a label. 2. Grids are useful for predicting midpoints.

Though he mentions grids are useful for predicting, the main goal lies in predicting the object itself. The grid suggests if the object is present or not(in other words to locate bx,by). The ground truth for the bounding box is w.r.t. the entire image. So the predictions for the bounding box(bh,bw) is w.r.t. the entire image, which suggests that the bounding boxes can lie within, on or out of the grid.

J_code
  • 1