You only look once
This detector is a little bit less precise (Improved on v2) but it is a really fast detector, this chapter will try to explain how it works and also give a reference working code in tensorflow.
The idea of this detector is that you run the image on a CNN model and get the detection on a single pass. First the image is resized to 448x448, then fed to the network and finally the output is filtered by a Non-max suppression algorithm.
The tiny version is composed with 9 convolution layers with leaky relu activations. Observe that after maxpool6 the 448x448 input image becomes a 7x7 image.
The output of this model is a tensor batch size 7x7x30. In this tensor the following information is encoded:
- 2 Box definitions: (consisting of: x,y,width,height,"is object" confidence)
- 20 class probabilities (only considered if the "is object" confidence is high)
- S: Tensor spatial dimension (7 on this case)
- B: Number of bounding boxes (x,y,w,h,confidence)
- C: Number of classes
Here "is object" or is the probability that a box contains any object (or it is background), if during training a particular cell is not over some object we set "is object" to zero.
What this 7x7 tensor represents
This 7x7 tensor can be considered as a 7x7 grid representing the input image, where each cell of this tensor will hold the 2 box definitions and 20 class probabilities.
Here it's also useful to say that each cell has the probability to be one of the 20 classes. (And each cell has 2 bounding box)
Notice that this information with the fact that each bounding box has the information if it's below an object or not will help to detect the class of the object.
The logic is that if there was an object on that cell, we define which object by using the biggest class probability value from that cell.
At the end of the model at prediction time you will have something like this:
Finally by using thresholding and non-maxima suppression we can filter out boxes that are not valid detections.
Look which cell is near the center of the bounding box of the Ground truth. (Matching phase)
Check from a particular cell which of it's bounding boxes overlaps more with the ground truth (IoU), then decrease the confidence of the bounding box that overlap less. (Each bounding box has it's on confidence)
Decrease the confidence of all bounding boxes from each cell that has no object. Also don't adjust the box coordinates or class probabilities from those cells.
Decrease the bounding boxes confidence of the cells that don't contain any object.
The paper mentioned that before training for object detection, they modified the network (Add a Average pooling, FC and Softmax) layers and train for classification on the Imagenet Dataset for one week. (Until they got a good top 5 error). Later they add more conv layers and the FC layer responsible for detection.
- Pre-trained on Imagenet
- Use lot's of augmentation
- Use SGD to train
- Evaluated on Pascal VOC
- 135 Epochs, batch size: 64
- Momentum 0.9
- Random scale and translations up to 20% size of original image
- Color exposure/saturation augmentation
Here is the multi-part loss function that we want to optimize. This loss function take into account the following objectives:
- Classification (20 classes)
- Object/No object classification
- Bounding box coordinates (x,y,height,width) regression (4 scalars)
Each of this sub objectives use a sum-squared error, also a factor and are used to unbalance the box coordinates and the classification objectives.
Some other points to observe:
- The classification loss is not back propagated if the cell has no object
- The bounding box loss with highest IOU (Intersect over union) with the ground truth is backpropagated
- B: Number of bounding boxes (2)
- Box definition
- Some particular class i
- S: Grid size (7)
- : If object appear on the cell i, if does not appear it will be zero
- : Bounding box j, from cell i responsible for prediction
Intersect over Union (IoU)
It's a method used to evaluate how well an object detection output is related to some ground truth, the IoU is normally used during training and testing by comparing how the bounding box given during prediction overlap with the ground truth (training/test data) bounding box.
Calculating the IoU is simple we basically divide the overlap area between the boxes by the union of those areas.
# Calculate Intersect over usion between boxes b1 and b2, here each box is defined with 2 points # box(startX, startY, endX, endY), there are other definitions ie box(x,y,width,height) def calc_iou(b1, b2): # determine the (x, y)-coordinates of the intersection rectangle xA = max(b1, b2) yA = max(b1, b2) xB = min(b1, b2) yB = min(b1, b2) # compute the area of intersection rectangle area_intersect = (xB - xA + 1) * (yB - yA + 1) # Calculate area of boxes area_b1 = (b1 - b1 + 1) * (b1 - b1 + 1) area_b2 = (b2 - b2 + 1) * (b2 - b2 + 1) # compute the intersection over union by taking the intersection # area and dividing it by the sum of prediction + ground-truth # areas - the intersection area iou = area_intersect / float(area_b1 + area_b2 - area_intersect) # return the intersection over union value return iou
Another way to calculate the IoU with numpy
import numpy as np def calc_iou(xy_min1, xy_max1, xy_min2, xy_max2): # Get areas areas_1 = np.multiply.reduce(xy_max1 - xy_min1) areas_2 = np.multiply.reduce(xy_max2 - xy_min2) # determine the (x, y)-coordinates of the intersection rectangle _xy_min = np.maximum(xy_min1, xy_min2) _xy_max = np.minimum(xy_max1, xy_max2) _wh = np.maximum(_xy_max - _xy_min, 0) # compute the area of intersection rectangle _areas = np.multiply.reduce(_wh) # return the intersection over union value return _areas / np.maximum(areas_1 + areas_2 - _areas, 1e-10)
Non-Maxima Suppression (nms)
During prediction time (after training) you may have lot's of box predictions around a single object the nms algorithm will filter out those boxes that overlap between each other and also some threshold.
Here we have a example with numpy and python
def non_max_suppress(conf, xy_min, xy_max, threshold=.4): _, _, classes = conf.shape # List Comprehension # https://www.youtube.com/watch?v=HobjHIpLhZk # https://www.youtube.com/watch?v=Q7EYKuZJfdA boxes = [(_conf, _xy_min, _xy_max) for _conf, _xy_min, _xy_max in zip(conf.reshape(-1, classes), xy_min.reshape(-1, 2), xy_max.reshape(-1, 2))] # Iterate each class for c in range(classes): # Sort boxes boxes.sort(key=lambda box: box[c], reverse=True) # Iterate each box for i in range(len(boxes) - 1): box = boxes[i] if box[c] == 0: continue for _box in boxes[i + 1:]: # Take iou threshold into account if calc_iou(box, box, _box, _box) >= threshold: _box[c] = 0 return boxes
The Yolo detector has been improved recently, to list their main improvements:
- More Accurate (73.4 mAP(Mean average precision over all classes) on Pascal dataset)
- Can detect up to 9000 classes (Before was 20)
What they did to improve:
- Added Batchnorm
- Pre-train on imagenet at multiple scales (224x224) then (448x448), then only after they train for detection.
- Now they use anchor boxes like Faster-RCNN , the classification is done per-box shape, instead of per each grid-cell
- Instead of manually choose the box shape, they use K-means to get a box shape based on data
- Train the network at multiple scales, as the network is now Fully Convolutional (NO FC layer) this is easy to do.
- They train on both Image-net and MS-COCO
- They create a new mechanism to train on datasets that don't have detection data. By selecting on the multi-part loss function what to propagate.
- Use WordTree to combine data from various sources and our joint optimization technique to train simultaneously on ImageNet and COCO.