Object Detection(Yolo)PyTorch实现（一）

编程入门行业动态更新时间:2024-10-26 08:21:11

Object <a href=https://www.elefans.com/category/jswz/34/1767229.html style= Detection(Yolo)PyTorch实现（一）"/>

Object Detection(Yolo)PyTorch实现（一）

object detection基础

什么是Object Detection？

找到并且确定图片中的所有待监测对象

目标检测框的确定方法

首先基于目标检测框是一个矩形的事实，我们可以使用4个值来唯一确定一个目标检测框在图中的位置,以下是常用的两种方法，分别称为corners和midpoint

左上方和右下方顶点的坐标 ( x 1 , y 1 , x 2 , y 2 ) ( x_1,y_1,x_2,y_2) (x1,y1,x2,y2)
检测框中心点坐标和框的宽高 ( x , y , w , h ) (x, y ,w, h) (x,y,w,h)

如何衡量预测检测框和真实检测框之间的差距？

我们一般通过 I O U ( i n t e r s e c t i o n o v e r u n i o n ) IOU(intersection\ over\ union) IOU(intersection over union)来衡量. A ⋂ B A ⋃ B \frac {A\bigcap B} {A\bigcup B} A⋃BA⋂B A为真实框区域，B为预测框区域。
两个目标框的交集怎么求呢？
首先，我们需要意识到，两个框的交集仍然是一个框，那么就可以用左上和右下的坐标来唯一确定。

这是一张图片的坐标系，左上方是（ 0 ， 0 ）（0，0）（0，0），右下方是（ 1 ， 1 ）（1，1）（1，1）
假设真实框的左上角坐标是 ( x 1 , y 1 ) (x_1,y_1) (x1,y1)，预测框的左上角坐标是 ( x 2 , y 2 ) (x_2,y_2) (x2,y2),那么它们交集框的左上角坐标就是 ( m a x ( x 1 , x 2 ) , m a x ( y 1 , y 2 ) ) (max(x_1,x_2), max(y_1,y_2)) (max(x1,x2),max(y1,y2)),这里读者可以用草稿纸画一下。类似的，右下角的坐标是 ( m i n ( x 3 , x 4 ) , m i n ( y 3 , y 4 ) ) (min(x_3,x_4),min(y_3,y_4)) (min(x3,x4),min(y3,y4))

def intersection_over_union(predict, target, box_format='midpoint'):"""predict、target:tensor类型 shape[..., 4]box_format:corners/midpoint """if box_format == 'midpoint':# 预测框  下面之所以用这种切片索引是为了保留维度box1_x1 = predict[..., 0:1] - predict[..., 2:3] / 2 # midpoint模式下是中点减去1/2宽得左上角xbox1_y1 = predict[..., 1:2] - predict[..., 3:] / 2 box1_x2 = predict[..., 0:1] + predict[..., 2:3] / 2 # midpoint模式下是中点加上1/2宽得右下角xbox1_y2 = predict[..., 1:2] + predict[..., 3:] / 2# 真实框box2_x1 = target[..., 0:1] - target[..., 2:3] / 2 # midpoint模式下是中点减去1/2宽得左上角xbox2_y1 = target[..., 1:2] - target[..., 3:] / 2 box2_x2 = target[..., 0:1] + target[..., 2:3] / 2 # midpoint模式下是中点加上1/2宽得右下角xbox2_y2 = target[..., 1:2] + target[..., 3:] / 2elif:# 预测框box1_x1 = predict[..., 0:1]  # corners模式下直接能取到左上角xbox1_y1 = predict[..., 1:2]box1_x2 = predict[..., 2:3]box1_y2 = predict[..., 3:4]# 真实框box2_x1 = target[..., 0:1]box2_y1 = target[..., 1:2]box2_x2 = target[..., 2:3]box2_y2 = target[..., 3:4]# 确定交集框的两个corner点的坐标x1 = torch.max(box1_x1, box2_x1)y1 = torch.max(box1_y1, box2_y1)x2 = torch.max(box1_x2, box2_x2)y2 = torch,max(box1_y2, box2_y2)# 交集的面积intersection = (x2 - x1).clamp(0) * (y2 - y1).clamp(0)  # 宽和高 # 下面求预测框和真实框的面积，并求交集box1_area = torch.abs((box1_x2 - box1_x1) * (box1_y2 - box1_y1))box2_area = torch.abs((box2_x2 - box2_x1) * (box2_y2 - box2_y1))union = box1_area + box2_area - intersectionreturn intersection / union

non max suppression 非最大化抑制

当一个待探测目标有多个框检测到了它，但是我们只想保留效果最好的那一个，就需要用nms算法来去除多余的探测框。
nms中，每个box的数据格式为【class, confidence, x1, y1, w1, h1】。其中class代表这个探测框预测了哪个类别，confidence表示这个框里包含物体的置信度，后面四个值是midpoint形式下确定一个探测框的4个点。
nms算法：
输入：boxes，包含许多box，每个box都是如上所说的格式
输出：boxes_after_nms经过非最大化抑制的boxes（清除了一些效果不好的box）
过程：

遍历boxes，将置信度低于阈值的全部删除
将剩下的boxes按照置信度从大到小排序
每次从boxes中将最大置信度的chosen_box拿出，然后遍历boxes中剩下的box，如果chosen_box和box预测的类别相同并且IOU值大于一个阈值，那么这个box就做了重复的工作，将其删除
将choice_box插入待返回的列表boxes_after_nms
重复3，4,直到boxes为空

def non_max_suppression(boxes, threshold, iou_threshold, box_format='midpoint'):"""boxes:list,每个box[class, probability, x1, y1, weight, height]threshold:置信度的阈值，小于这个值我们认为里面没有物体，将其删除iou_threshold:choice_box与box之间的iou阈值box_format:corners/midpoint"""boxes = [box for box in boxes if box[1] > threshold]  # 低于置信度的全部干掉boxes_after_nms = []boxes = sorted(boxes, key=lambda x : x[1], reverse=True)  # 按照置信度进行排序while boxes:chosen_box = boxes.pop(0)  # 将当前置信度最大的拿出来，我们默认它的探测效果最好boxes = [box for box in boxes if choice_box[0] != box[0] orintersection_over_union(torch.tensor(box[2:]),torch.tensor(chosen_box[2:])) < iou_threshold]boxes_after_nms.append(choice_box)return boxes_ater_nms

Mean Average Precision

MAP是评价目标探测效果好坏的一个重要指标，计算它之前需要先知道precision和recall的概念。
假设我们预测出了一个检测框，通过IOU评测方法iou以后大于了阈值，我们认为它的效果不错，就归为了正确的预测框，类似这种预测正确的样本称其为True Positive(TP)。如果IOU比阈值小，就认为这个样本是错误的，这种预测是正确的样本，但实际是错的这种，称其为False Positive（FP）。
最后假设原本正确样本（真实框）的总和是Total，我们就可以借上面提到的三个量定义precision和recall了。
P r e c i s i o n = T P T P + F P Precision = \frac {TP} {TP + FP} Precision=TP+FPTP
R e c a l l = T P T o t a l Recall = \frac {TP} {Total} Recall=TotalTP
分析每个样本预测的正确性后，我们会得到一个新的precision和recall，这样以recall为横坐标，precision为纵坐标，将每个坐标点依次连接，得出的面积就是Average Precision。

Precision	Recall
1/1	1/4
1/2	1/4
2/3	2/4
2/4	2/4
3/5	3/4
3/6	3/4
3/7	3/4

每一个class都有Average Precision，将所有class的AP相加除以类别数就得到了MAP（mean average precision）

算法实现
输入：一个batch的预测boxes，真实框labels [ i n d e x , c l a s s , c o n f i d e n c e , x , y , w , h ] \ \ [index, class , confidence, x, y ,w, h] [index,class,confidence,x,y,w,h],这里的index代表的是不同的图片。
输出：一个batch的mean average precision

过程：

分别计算每一个类别的mean average precision，从boxes和labels中收集对应类别的数据项，分别保存在class_boxes, class_labels中
class_boxes的长度代表detection的次数，考虑用一个tensor来记录每次detection是否为TP
对class_boxes和class_labels按照index排序再分组
遍历class_boxes时，对于每一个图片的boxes，如果对应的图片中没有真实探测框（class_labels分组后没有对应的key），该组图片直接被判为负样本
如果class_labels中有对应的key，将class_boxes中对应分组内部按照confidence从大到小排序，并去与真实框依次求IOU，得到一个最大的IOU并判断是否大于阈值，如果是则被定为正样本，否则被定为负样本。（可能预测的框有多个框与某个真实框IOU较大，只取一个当正样本，其余都为负样本）

from collections import Counter
def mean_average_precision(predict, target, num_classes, iou_threshold=0.5, box_format='midpoint'):"""predict,target:list of [index, class, confidence, x, y, width, height]"""# 记录每个class的AP值average_precision = []# 遍历每个类for c in range(num_classes):class_boxes = []class_labels = []for box, label in zip(class_boxes, class_labels):if box[1] == c:class_boxes.append(box)if label[1] == c:class_labels.append(label)total_ground_truth = len(class_labels)if total_ground_truth == 0:continue# 记录每次detection是正样本还是负样本TP = torch.zeros(len(class_boxes))FP = torch.zeros(len(class_boxes))amounted_labels = Counter([label[0] for label in class_labels])for key, value in amounted_labels.items():amounted_labels[key] = torch.zeros(value)class_boxes = sorted(class_boxes, key=lambda x : x[2], reverse=True) # 按照置信度进行排序for detection_step, box in enumerate(class_boxes):ground_truth = [label for label in class_labels if label[0] == box[0]]# 如果对应index没有box，直接判为负样本if len(ground_truth) == 0:FP[detection_step] = 1continuebest_iou = 0best_index = 0for idx, gt in enumerate(ground_truth):iou = intersection_over_union(torch.tensor(box[3:]), torch.tensor(gt[3:]), box_format)if iou > best_iou:best_iou = ioubest_index = idxif best_iou > iou_threshold and amounted_labels[box[0]][best_index] == 0:amounted_labels[box[0]][best_index] = 1TP[detection_step] = 1else:FP[detection_step] = 1TP_cumsum = torch.cumsum(TP, dim=0)FP_cumsum = torch.cumsum(FP, dim=0)recall = TP_cumsum / total_ground_truthrecall = torch.cat([torch.tensor([0]), recall])precision = TP_cumsum / (TP_cumsum + FP_cumsum)precision = torch.cat([torch.tensor(1), precision])average_precision.append(torch.trapz(precision, recall))return sum(average_precision) / len(average_precision)

YoloV1论文重点内容

在YOLO之前的几个算法把object detection当作一个two stages的任务。而YOLO重新定义了object detection任务。它把object detection当作一个单独的回归任务，直接从图像中提取特征，然后输出box的坐标以及类别的概率。
当成回归任务以后，YOLO将具有如下的好处
First, Y O L O i s e x t r e m e l y f a s t . YOLO\ is\ extremely\ fast. YOLO is extremely fast.
S i n c e w e f r a m e d e t e c t i o n a s a r e g r e s s i o n p r o b l e m w e d o n ’ t n e e d a c o m p l e x p i p e l i n e . Since\ we\ frame\ detection\ as\ a\ regression\ problem\ we\ don’t\ need\ a\ complex\ pipeline. Since we frame detection as a regression problem we don’t need a complex pipeline.
W e s i m p l y r u n o u r n e u r a l n e t w o r k o n a n e w i m a g e a t t e s t t i m e t o p r e d i c t d e t e c t i o n s . We\ simply\ run\ our\ neural\ network\ on\ a\ new\ image\ at\ test\ time\ to\ predict\ detections. We simply run our neural network on a new image at test time to predict detections.
O u r b a s e n e t w o r k r u n s a t 45 f r a m e s p e r s e c o n d w i t h n o b a t c h p r o c e s s i n g o n a T i t a n X G P U Our\ base\ network\ runs\ at\ 45\ frames\ per\ second\ with\ no\ batch\ processing\ on\ a \ Titan\ XGPU Our base network runs at 45 frames per second with no batch processing on a Titan XGPU
a n d a f a s t v e r s i o n r u n s a t m o r e t h a n 150 f p s . and\ a \ fast \ version\ runs \ at\ more\ than\ 150 \ fps. and a fast version runs at more than 150 fps.
T h i s m e a n s w e c a n p r o c e s s s t r e a m i n g v i d e o i n r e a l − t i m e w i t h l e s s t h a n 25 m i l l i s e c o n d s o f l a t e n c y . This\ means \ we\ can \ process\ streaming\ video\ in\ real-time \ withless\ than \ 25\ milliseconds\ of \ latency. This means we can process streaming video in real−time withless than 25 milliseconds of latency.
F u r t h e r m o r e , Y O L O a c h i e v e s m o r e t h a n t w i c e t h e m e a n a v e r a g e p r e c i s i o n o f o t h e r r e a l − t i m e s y s t e m s . Furthermore, \ YOLO\ achieves\ more \ than twice \ the \ mean \ average\ precision \ of other \ real-time\ systems. Furthermore, YOLO achieves more thantwice the mean average precision ofother real−time systems.

YOLO因为one stage的回归任务，检测速度会非常快！非常适合real-time的监测任务，并且比其他算法要的MAP值要高。

Second, YOLO reasons globally about the image whenmaking predictions。
考虑了图片整体的信息进行detection，不像类似于滑动窗口这样的算法，无法获得一个全局的context。

模型将输入的图片划分成 S ∗ S S * S S∗S的网格，论文中 S = 7 S=7 S=7，所以一张图片可以获得49个单元格(cell)。如果一个物体的中心落入了某个cell中，那么这个cell就负责detect这个目标。这里要表达的意思是，一个cell只能预测一类物体，如果多个物体落入了同一个单元格，它也只会预测出一个物体。
每个gird cell预测出 B B B个检测框以及对应的confidence，论文中 B = 2 B=2 B=2。confidence的分数反应了多大的概率这个框里面包含了物体。
c o n f i d e n c e = P r ( o b j ) ∗ I O U p r e d t r u t h confidence = P_r (obj)* IOU^{truth}_{pred} confidence=Pr(obj)∗IOUpredtruth，如果没有物体在这个框里，confidence应该等于，否则要等于IOU的值。
每个预测的bounding box包含五个值 ( x , y , w , h , c o n f i d e n c e ) (x, y, w, h, confidence) (x,y,w,h,confidence)x, y是相对于cell的比例，w, h是相对于整个图片的比例，confidence表示了预测框与真实框之间的IOU。

网络的预测结果中，一部分预测出了bounding box（对应figure 2中间的上面一张图）, 另一部分预测了类别概率（对应于下面一张图）。

YOLO用到的模型如下，24层卷积层来提取图片的整体特征，2层全连接层来完成object detect任务。

"""
Implementation of Yolo (v1) architecture
with slight modification with added BatchNorm.
"""
import torch
import torch.nn as nn
""" 
Information about architecture config:
Tuple is structured by (kernel_size, filters, stride, padding) 
"M" is simply maxpooling with stride 2x2 and kernel 2x2
List is structured by tuples and lastly int with number of repeats
"""
architecture_config = [(7, 64, 2, 3),"M",(3, 192, 1, 1),"M",(1, 128, 1, 0),(3, 256, 1, 1),(1, 256, 1, 0),(3, 512, 1, 1),"M",[(1, 256, 1, 0), (3, 512, 1, 1), 4],(1, 512, 1, 0),(3, 1024, 1, 1),"M",[(1, 512, 1, 0), (3, 1024, 1, 1), 2],(3, 1024, 1, 1),(3, 1024, 2, 1),(3, 1024, 1, 1),(3, 1024, 1, 1),
]
class CNNBlock(nn.Module):def __init__(self, in_channels, out_channels, **kwargs):super(CNNBlock, self).__init__()self.conv = nn.Conv2d(in_channels, out_channels, bias=False, **kwargs)self.batchnorm = nn.BatchNorm2d(out_channels)self.leakyrelu = nn.LeakyReLU(0.1)def forward(self, x):return self.leakyrelu(self.batchnorm(self.conv(x)))class Yolov1(nn.Module):def __init__(self, in_channels=3, **kwargs):super(Yolov1, self).__init__()self.architecture = architecture_configself.in_channels = in_channelsself.darknet = self._create_conv_layers(self.architecture)self.fcs = self._create_fcs(**kwargs)def forward(self, x):x = self.darknet(x)return self.fcs(torch.flatten(x, start_dim=1))def _create_conv_layers(self, architecture):layers = []in_channels = self.in_channelsfor x in architecture:if type(x) == tuple:layers += [CNNBlock(in_channels, x[1], kernel_size=x[0], stride=x[2], padding=x[3],)]in_channels = x[1]elif type(x) == str:layers += [nn.MaxPool2d(kernel_size=(2, 2), stride=(2, 2))]elif type(x) == list:conv1 = x[0]conv2 = x[1]num_repeats = x[2]for _ in range(num_repeats):layers += [CNNBlock(in_channels,conv1[1],kernel_size=conv1[0],stride=conv1[2],padding=conv1[3],)]layers += [CNNBlock(conv1[1],conv2[1],kernel_size=conv2[0],stride=conv2[2],padding=conv2[3],)]in_channels = conv2[1]return nn.Sequential(*layers)def _create_fcs(self, split_size, num_boxes, num_classes):S, B, C = split_size, num_boxes, num_classes# In original paper this should be# nn.Linear(1024*S*S, 4096),# nn.LeakyReLU(0.1),# nn.Linear(4096, S*S*(B*5+C))return nn.Sequential(nn.Linear(1024 * S * S, 496),nn.LeakyReLU(0.1),nn.Linear(496, S * S * (C + B * 5)),)

以上是模型的实现，这仅仅是卷积神经网络模型和全连接层的堆叠，我们需要关注的是，最终模型输出的张量格式是（batch_size, S x S x (C + B x 5)。也就是（batch_size， 7 x 7 * 30）

本模型的代价函数是sum-squared error，作者提出把目标定位错误和分类错误等同不是一个理想的方案。同样大部分bounding box中是没有物体的，训练结果会使得它们的confidence趋向0，但是这可能会导致bounding box中包含物体的梯度更新过头。
所以，基于以上几点作者提出要提高bounding box坐标的loss，降低confidence predictions for boxes that don’t contain object
的loss。
sum-squared-error对待大框和小框的行为是一样的，但实际情况二者的敏感程度应该不一样，所以求loss之前先开根号。
每个grid cell可能会预测多个bounding box，我们取出其中和ground truth iou值最高的那一个当作responsible box。

这里的 1 i j o b j 1_{ij}^{obj} 1ijobj就是表示从每个grid cell中IOU值最大的那一个。
下面用PyTorch代码来实现这个loss函数。

import torch
from torch import nn
class YoLoLoss(nn.Module):def __init__(self, S=7, B=2, C=20):self.S = Sself.B = Bself.C = Cself.lambda_coord = 5self.lambda_noobj = 0.5self.mse = nn.MSELoss(reduction='sum')		def forward(self, preds, target, box_format='midpoint'):# preds 是神经网络的输出 (N, S * S * (C + B * 2)) target是标签 (N, S, S, C + B * 2)batch_size = preds.shape[0]preds = preds.reshape(batch_size, self.S, self.S, -1)# 第一步将grid cell 中那个iou最高的responsible box找出来iou1 = intersection_over_union(preds[..., 21:25], target[..., 21:25], box_format)  # [N, S, S, 1]iou2 = intersection_over_union(preds[..., 26:30], target[..., 21:25], box_format)  # [N, S, S, 1]ious = torch.stack([iou1, iou2],dim=0)  # [2, N, S, S, 1]_, reponsible_boxes = torch.max(ious, dim=0)  # 需要找到每个iou值最大的位置[N, S, S, 1]exist_boxes = target[..., 20].unsqueeze(3)  # [N, S, S, 1]  # 用来表明是否包含物体, 因为没有物体confidence就是0，包含物体confidence就是0# 把responsible_boxes 对应的坐标(x, y, w, h)拿出来box_coordinate = (1 - reponsible_boxes) * preds[..., 21:25] + reponsible_boxes * preds[..., 26:30]  # [N, S, S, 4]box_coordinate = exist_boxes * box_coordinate  # 有物体还是没物体# 将width和height开根号 torch.sign是为了保留负号box_coordinate[2:4] = torch.sign(box_coordinate[2:4]) * torch.sqrt(torch.abs(box_coordinate[2:4])target[..., 23:25] = torch.sqrt(target[..., 23:25])# 这一项同时计算了x,y,w,h对应的loss, 包含了论文中第一项和第二项的losscoordinate_loss = self.mse(torch.flatten(box_coordinate, end_dim=-2),  # 最后一个维度不展开torch.flatten(target[..., 21:25], end_dim=-2))# 为了把responsible对应的confidence拿出来,注意要用切片形式来保存维度box_confidence = (1 - responsible_boxes) * preds[..., 20:21] + responsible * preds[..., 25:26]box_confidence = exist_boxes * box_confidence# 计算confidence对应的loss， 包含了原文中第三项lossconfidence_loss = self.mse(torch.flatten(box_confidence, end_dim=-2),torch.flatten(target[..., 20:21], end_dim=-2))# 计算noobj的lossnoobj_loss = self.mse(torch.flatten((1 - exist_boxes) * preds[..., 20:21], end_dim=-2),torch.flatten((1 - exist_boxes) * target[..., 20,21], end_dim=-2))noobj_loss += self.mse(torch.flatten((1 - exist_boxes) * preds[..., 25:26], end_dim=-2),torch.flatten((1 - exist_boxes) * preds[..., 20:21], end_dim=-2))# 计算class的lossclass_loss = self.mse(torch.flatten(exist_boxes * preds[....,:20], end_dim=-2),torch.flatten(target[...,:20], end_dim=-2))loss = self.lambda_coord * coordinate_loss + confidence_loss + self.noobj * noobj_loss + class_lossreturn loss

更多推荐

Object Detection(Yolo)PyTorch实现（一）

本文发布于:2023-07-28 18:48:35，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/jswz/34/1278874.html