5

The YOLO model splits the image into smaller boxes and each box is responsible for predicting 5 bounding boxes.

My question is how does the model make these bounding boxes for every grid cell ? Does each box have a predefined offset with respect to say the center of the grid cell.

I AM NOT TALKING ABOUT THE FINAL BOUNDING BOX THAT ENCLOSES THE OBJECT I am talking about the 5 predicting bounding boxes that are present for each grid cell.

Like for example if the smaller grid cell is located at say 50x50 (the center of it) then the bounding boxes should be at (50+5)x(50+5) or something like that

If not then how do the bounding boxes come to be ?

Paper - https://arxiv.org/pdf/1506.02640.pdf enter image description here

2 Answers2

2

Andrew Ng's explanation actually covers the YOLOv2 which uses anchor boxes. YOLOv1, which is the paper you linked, does not use anchor boxes so its not exactly the same.

They key to understanding how the bounding boxes are formed is to first understand how the output is encoded. To which, I'll recommend this link: https://hackernoon.com/understanding-yolo-f5a74bbc7967

Briefly speaking, and I'll be using the example from the paper, for S=7, B=2 and C=20, our output is a 7x7x30 tensor that encodes where (bounding box coordinates) and what the objects (probability of class) are. To achieve this, we construct a fully-connected layer at the end of our CNN that will give us 7x7x30 (rather forcefully). Hence on our first forward pass, each cell will have 2 random bounding boxes. A loss is calculated. The weights of the CNN will then be adjusted according to reduce that loss (opitimisation). Then the following passes will produce bounding boxes closer to the ground truth.

Baymax Lim
  • 31
  • 4
1

I think Andrew Ng's explanation might help you get a better understanding of the algorithm. Scan through the playlist, it explains YOLO in a very simple way and perhaps read the paper again once you have watched the video to get a complete understanding of how things work.