自动驾驶-BEV检测篇三：DETR-3D_业界新闻

发布时间:2024-08-03 09:41

阅读量:0

论文地址：DETR3D: 3D Object Detection from Multi-view Images via 3D-to-2D Queries
代码地址：WangYueFt/detr3d (github.com)

1、引言

DETR3D是3D目标检测算法的重要组成之一，是DETR2D在3D空间中进行检测的开创性工作。不同于LSS、BEVDet等一系列自上而下的方法，先进行深度估计，再进行2D-D转换的解决方案。

DETR3D通过先预设一系列预测框的查询向量object querys，利用它们生成3D reference point，将这些3D reference point 利用相机参数转换矩阵，投影回2D图像坐标，并根据他们在图像的位置去找到对应的图像特征，用图像特征和object querys做cross-attention，不断refine object querys。最后利用两个MLP分支分别输出分类预测结果与回归预测结果。正负样本则采用和DETR相同的二分图匹配，即根据最小cost在900个object querys中找到与GT数量最匹配的N个预测框。由于正负样本匹配以及object querys这种查询目标的方式与DETR类似，因此可以看成是DETR在3D的扩展。

注：总体来看，DETR3D也是一个transformer结构的检测框架，但是其没有encoder的结构，使用的是传统的卷积网络的backbone来进行特征的提取工作。

总体步骤：

（1）首先利用图像特征提取网络，例如resnet50等，对不同视角相机拍摄的图像进行特征提取。（这里可以直接认为就是transformer结构中的encoder结构，其实完全不同，可以看出是ViT结构还没有出来之前，早期将transformer和视觉任务结合的尝试）。

（2）使用nn.embeding初始化一个Object Query Embedding,之后利用一个全连接MLP回归出一个3D参考点 $C^{_{i}}$ （也就是第i个box的中心）。

（3）通过相机的内外参数（2）得到的3D参考点ref point(世界坐标系下的点)将其投影到相机平面的特征图上去，下面就是使用2维的DETR来进行2D目标检测的方法了。

（4）由于每一路输入都是多尺度特征，为了避免不同尺度之间分辨率的影响，使用双线性差值对特征图进行采样，得到不同尺度下的特征差值采样结果{F1,F2,F3,F4}。（这里其实可以理解为不同尺度的特征图和reference_2D之间做一个cross-attention交互）。

（5）将不同尺度特征的采样结果进行合并，再加入原本的Object Query Embedding进行精炼。

（6）多次迭代，最终从最后的Query Embedding回归出类别和位置。

2、pipeline

2.1 img_backbone + grid_mask + img_neck

2.1.1 原理

这里img_backbone使用的resnet50(也是比较常见的图像提取网络了)，gird_mask也是比较很常见的数据增强方案，感兴趣的可以参考一下我的bevformer的博客。

最终得到的不同尺度img_feats为一个list，其中包含不同大小的特征图:[1,6,256,116,200],[1,6,256,58,100],[1,6,29,50],[1,6,15,25]。

2.1.2 代码

extract_img_feat函数

    def extract_img_feat(self, img, img_metas):         """Extract features of images."""         B = img.size(0)         if img is not None:             input_shape = img.shape[-2:]             # update real input shape of each single img             for img_meta in img_metas:                 img_meta.update(input_shape=input_shape)              if img.dim() == 5 and img.size(0) == 1:                 img.squeeze_()             elif img.dim() == 5 and img.size(0) > 1:                 B, N, C, H, W = img.size()                 img = img.view(B * N, C, H, W)             if self.use_grid_mask:                 img = self.grid_mask(img)             img_feats = self.img_backbone(img)             if isinstance(img_feats, dict):                 img_feats = list(img_feats.values())         else:             return None         if self.with_img_neck:             img_feats = self.img_neck(img_feats)         img_feats_reshaped = []         for img_feat in img_feats:             BN, C, H, W = img_feat.size()             img_feats_reshaped.append(img_feat.view(B, int(BN / B), C, H, W))         return img_feats_reshaped

2.2 Decoder

2.2.1 query初始化 + reference_points初始化

2.2.1.1 原理

这里的query_embeds其实是由nn.Embedding初始化的一组可以学习的特征，shape为[900,512]。通过torch.split函数将其分为query和query_pos，分别表示query和query的位置编码。

通过一个MLP，将shape为[900,256]的query回归到一个3D中心点坐标reference_point（shape为[900,3]），再做了一个sigmoid将其映射到0~1之间。

注：这边的900怎么理解呢，我的个人理解是num_query，对应之后的reference_point的个数，相当于是我预测900个bounding-box，每一个box的编码是256

# self.query_embedding = nn.Embedding(self.num_query,self.embed_dims * 2) query_embeds = self.query_embedding.weight      # [900,512]

2.2.1.2 代码

    def forward(self,                 mlvl_feats,         # [1,6,256,116,200]、[1,6,256,58,100]、[16,256,29,50]、[1,6,256,15,25]                 query_embed,        # [900,512]                 reg_branches=None,  # 6个全连接层                 **kwargs):          # []         """Forward function for `Detr3DTransformer`.         Args:             mlvl_feats (list(Tensor)): Input queries from                 different level. Each element has shape                 [bs, embed_dims, h, w].             query_embed (Tensor): The query embedding for decoder,                 with shape [num_query, c].             mlvl_pos_embeds (list(Tensor)): The positional encoding                 of feats from different level, has the shape                  [bs, embed_dims, h, w].             reg_branches (obj:`nn.ModuleList`): Regression heads for                 feature maps from each decoder layer. Only would                 be passed when                 `with_box_refine` is True. Default to None.         Returns:             tuple[Tensor]: results of decoder containing the following tensor.                 - inter_states: Outputs from decoder. If                     return_intermediate_dec is True output has shape \                       (num_dec_layers, bs, num_query, embed_dims), else has \                       shape (1, bs, num_query, embed_dims).                 - init_reference_out: The initial value of reference \                     points, has shape (bs, num_queries, 4).                 - inter_references_out: The internal value of reference \                     points in decoder, has shape \                     (num_dec_layers, bs,num_query, embed_dims)                 - enc_outputs_class: The classification score of \                     proposals generated from \                     encoder's feature maps, has shape \                     (batch, h*w, num_classes). \                     Only would be returned when `as_two_stage` is True, \                     otherwise None.                 - enc_outputs_coord_unact: The regression results \                     generated from encoder's feature maps., has shape \                     (batch, h*w, 4). Only would \                     be returned when `as_two_stage` is True, \                     otherwise None.         """         assert query_embed is not None         bs = mlvl_feats[0].size(0)   # 1          query_pos, query = torch.split(query_embed, self.embed_dims , dim=1)        # query:[900,256] query_pos:[900,256]          query_pos = query_pos.unsqueeze(0).expand(bs, -1, -1)                       # [1,900,256]          query = query.unsqueeze(0).expand(bs, -1, -1)                               # [1,900,256]          reference_points = self.reference_points(query_pos)                         # [1,900,3]   Linear(in_features=256, out_features=3, bias=True)          reference_points = reference_points.sigmoid()                               # 压缩xyz坐标到[0,1]之间          init_reference_out = reference_points                                       # [1,900,3]          # decoder         query = query.permute(1, 0, 2)                  # [1,900,256]          query_pos = query_pos.permute(1, 0, 2)          # [1,900,256]          inter_states, inter_references = self.decoder(             query=query,                        # [1,900,256]             key=None,                           # None             value=mlvl_feats,                   # [1,6,256,115,200]、[1,6,256,58,100]、[1,6,256,29,50]、[1,6,256,15,25]             query_pos=query_pos,                # [1,900,256]             reference_points=reference_points,  # [1,900,3]             reg_branches=reg_branches,          # 6个全连接层             **kwargs)         # inter_states: [6,900,1,256]    inter_references:[6,1,900,3]         inter_references_out = inter_references     # [6,1,900,3]         return inter_states, init_reference_out, inter_references_out       # init_reference_out和inter_references_out的区别，多个decoder的stack的结果

2.2.2 feature_sampling—2D-3D转换模块

2.2.2.1 原理

feature_sampling函数的作用主要是根据从query中回归得到的3D中心点reference_points在不同尺度的mlvl_feats进行采样，得到中心点处的特征值。

（1）首先论文中定义的世界坐标系是激光雷达坐标系(车体坐标系)，需要将其转换到相机坐标系下的，再将其转换到像素坐标系上。

具体做法如下：lidar2img就是坐标系转换矩阵(其中包含了R,T)，先利用pc_range中保存的上下限对reference_points的xyz坐标进行缩放(因为之前通过sigmoid函数归一化到了0~1之间)，再将其变为齐次坐标xyzs，之后乘以坐标系转换矩阵，再将点复制6份，分别对应6个相机。之后又通过一系列的mask、过滤、缩放等操作，将其映射到了像素坐标系上。

（2）得到像素平面的2D中心点reference_points_cams之后(shape为[1,6,900,2])，就需要和之前backbone提取的多尺度图像特征进行采样交互了。这里还是使用F.grid_sample双线性插值函数进行采样的，也就是在不同尺度[6,256,116,200] [6,256,58,100] [6,256,29,50] [6,256,15,25]的特征图上采样900个点，最终每个特征图采样之后的shape都为[6,256,900,1]，但是其内容是不一样的，最后在将其在最后一个维度上stack起来，最终sampled_feats的shape变为[1,256,900,6,1,4]。

2.2.2.2 代码

# 十分关键的特征采样函数，2D-to-3D 特征变换 def feature_sampling(mlvl_feats, reference_points, pc_range, img_metas):     lidar2img = []     for img_meta in img_metas:         lidar2img.append(img_meta['lidar2img'])     lidar2img = np.asarray(lidar2img)                           # [1,6,4,4] 激光坐标系到相机坐标系的变换矩阵     lidar2img = reference_points.new_tensor(lidar2img)          # (B, N, 4, 4) [1,6,4,4] 就是将之前的numpy变为的tensor     reference_points = reference_points.clone()                 # [1,900,3]     reference_points_3d = reference_points.clone()              # [1,900,3]      # pc_range 的含义  [x_min, y_min, z_min, x_max, y_max, z_max]     reference_points[..., 0:1] = reference_points[..., 0:1] * (pc_range[3] - pc_range[0]) + pc_range[0]   # 对 x 进行缩放     reference_points[..., 1:2] = reference_points[..., 1:2] * (pc_range[4] - pc_range[1]) + pc_range[1]   # 对 y 进行缩放     reference_points[..., 2:3] = reference_points[..., 2:3] * (pc_range[5] - pc_range[2]) + pc_range[2]   # 对 z 进行缩放      # reference_points (B, num_queries, 4)    将非齐次坐标转换为齐次坐标     reference_points = torch.cat((reference_points, torch.ones_like(reference_points[..., :1])), -1)     B, num_query = reference_points.size()[:2]      # B:1  , num_query: 900     num_cam = lidar2img.size(1)     # 6     reference_points = reference_points.view(B, 1, num_query, 4).repeat(1, num_cam, 1, 1).unsqueeze(-1)     # [1,6,900,4,1] 复制6个相机的情况     lidar2img = lidar2img.view(B, num_cam, 1, 4, 4).repeat(1, 1, num_query, 1, 1)                           # [1,6,900,4,4] 复制6个相机转换矩阵     reference_points_cam = torch.matmul(lidar2img, reference_points).squeeze(-1)                            # [1,6,900,4]   乘以坐标转换矩阵     eps = 1e-5                                                                                              # 阈值      mask = (reference_points_cam[..., 2:3] > eps)                                                           # 过滤 [1,6,900,1]      reference_points_cam = reference_points_cam[..., 0:2] / torch.maximum(         reference_points_cam[..., 2:3], torch.ones_like(reference_points_cam[..., 2:3])*eps)                # [1,6,900,2]   将3D上的点映射到2d平面上      reference_points_cam[..., 0] /= img_metas[0]['img_shape'][0][1]     # 缩放 x      reference_points_cam[..., 1] /= img_metas[0]['img_shape'][0][0]     # 缩放 y      reference_points_cam = (reference_points_cam - 0.5) * 2      mask = (mask & (reference_points_cam[..., 0:1] > -1.0)                  & (reference_points_cam[..., 0:1] < 1.0)                   & (reference_points_cam[..., 1:2] > -1.0)                   & (reference_points_cam[..., 1:2] < 1.0))      mask = mask.view(B, num_cam, 1, num_query, 1, 1).permute(0, 2, 3, 1, 4, 5)                              # [1,1,600,6,1,1]     mask = torch.nan_to_num(mask)     sampled_feats = []     for lvl, feat in enumerate(mlvl_feats):         # 对FFN层的不同的多尺度特征进行操作  feat:[1,6,256,116,200]         B, N, C, H, W = feat.size()         feat = feat.view(B*N, C, H, W)          # [6,256,116,200]         reference_points_cam_lvl = reference_points_cam.view(B*N, num_query, 1, 2)  # [6,900,1,2]         sampled_feat = F.grid_sample(feat, reference_points_cam_lvl)                # [6,256,900,1]         sampled_feat = sampled_feat.view(B, N, C, num_query, 1).permute(0, 2, 3, 1, 4)  # [1,256,900,6,1]         sampled_feats.append(sampled_feat)     sampled_feats = torch.stack(sampled_feats, -1)  # [1,256,900,6,1,4]     sampled_feats = sampled_feats.view(B, C, num_query, num_cam,  1, len(mlvl_feats))  # [1,256,900,6,1,4]     return reference_points_3d, sampled_feats, mask

2.2.3 Detr3DCrossAtten模块

2.2.3.1 原理

Detr3DCrossAtten模块的主要作用是将query、value和reference_points之间的特征进行交互。

首先query = query + query_pos，即query + query的位置编码，然后再将其送给一个全连接层得到attention_weights，shape变为[1,900,256] -> [1,900,24] -> [1,1,900,6,1,4]。

之后就是上一小节讲得feature_sample进行特征采样，得到reference_points_3d和output，其中output中保存了不同尺度的图像特征采样后的特征结果。再将attention_weight和output进行注意力权重计算，再sum一下最后3各维度，并进行permute维度交互，最后再连接一个全连接层，进行特征映射得到最终的output结果，shape为[900,1,256]。

最后再使用self.position_encoder函数对3d点的信息reference_points_3d进行编码，使其的shape变为[900,1,256]。最后得到的结果为self.dropout(output) + inp_residual + pos_feat。也就是（dropout后的不同尺度图像采样特征）+（原始的query）+（3D点坐标的编码信息）。

2.2.3.2 代码

    def forward(self,                 query,                      # [900,1,256]                 key,                        # None                 value,                      # list [1,6,256,116,200] [1,6,256,58,100] [1,6,256,29,50] [1,6,256,15,25]                 residual=None,              # None                 query_pos=None,             # [900,1,256]                 key_padding_mask=None,      # None                 reference_points=None,      # [1,900,3]                 spatial_shapes=None,        # None                 level_start_index=None,     # None                 **kwargs):         """Forward Function of Detr3DCrossAtten.         Args:             query (Tensor): Query of Transformer with shape                 (num_query, bs, embed_dims).             key (Tensor): The key tensor with shape                 `(num_key, bs, embed_dims)`.             value (Tensor): The value tensor with shape                 `(num_key, bs, embed_dims)`. (B, N, C, H, W)             residual (Tensor): The tensor used for addition, with the                 same shape as `x`. Default None. If None, `x` will be used.             query_pos (Tensor): The positional encoding for `query`.                 Default: None.             key_pos (Tensor): The positional encoding for `key`. Default                 None.             reference_points (Tensor):  The normalized reference                 points with shape (bs, num_query, 4),                 all elements is range in [0, 1], top-left (0,0),                 bottom-right (1, 1), including padding area.                 or (N, Length_{query}, num_levels, 4), add                 additional two dimensions is (w, h) to                 form reference boxes.             key_padding_mask (Tensor): ByteTensor for `query`, with                 shape [bs, num_key].             spatial_shapes (Tensor): Spatial shape of features in                 different level. With shape  (num_levels, 2),                 last dimension represent (h, w).             level_start_index (Tensor): The start index of each level.                 A tensor has shape (num_levels) and can be represented                 as [0, h_0*w_0, h_0*w_0+h_1*w_1, ...].         Returns:              Tensor: forwarded results with shape [num_query, bs, embed_dims].         """          if key is None:             key = query         if value is None:             value = key          if residual is None:             inp_residual = query        # [900,1,256] 用于残差连接         if query_pos is not None:             query = query + query_pos   # [900,1,256]       query + query的位置编码          # change to (bs, num_query, embed_dims)         query = query.permute(1, 0, 2)  # [1,900,256]          bs, num_query, _ = query.size() # bs:1, num_query:900, _:256          # [1,900,256] -> [1,900,24] -> [1,1,900,6,1,4]         attention_weights = self.attention_weights(query).view(bs, 1, num_query, self.num_cams, self.num_points, self.num_levels)          # reference_points_3d：[1,900,3]         # output：[1,256,900,6,1,4]         # mask：[1,1,900,6,1,1]         reference_points_3d, output, mask = feature_sampling(value, reference_points, self.pc_range, kwargs['img_metas'])         output = torch.nan_to_num(output)  #  torch.nan_to_num，用于将张量中的非数字（NaN, 正无穷inf, 负无穷-inf）替换为数值         mask = torch.nan_to_num(mask)      #  torch.nan_to_num，用于将张量中的非数字（NaN, 正无穷inf, 负无穷-inf）替换为数值          attention_weights = attention_weights.sigmoid() * mask      #　[1,1,900,6,1,4]         output = output * attention_weights                         #　[1,256,900,6,1,4]         output = output.sum(-1).sum(-1).sum(-1)                     #　[1,256,900]  合并6、1、4维度         output = output.permute(2, 0, 1)                            #  [900,1,256]                  output = self.output_proj(output)                           #  [900,1,256]          # (num_query, bs, embed_dims)  还把3d点的信息进行编码 与 采样之后的图像特征 相加         pos_feat = self.position_encoder(inverse_sigmoid(reference_points_3d)).permute(1, 0, 2) # [1,900,3] -> [1,900,256] -> [900,1,256]          return self.dropout(output) + inp_residual + pos_feat

2.2.4 其他结构

2.2.4.1 原理

Detr3DHead函数的forward函数。

首先通过transformer类得到6个decoder输出的采样特征图hs，shape为[6,1,900,256]。之后分别用两个全连接层对每一个decoder输出的特征图hs[lvl],shape为[1,900,256]进行全连接映射到[1,900,10]，得到outputs_class和tmp，分别用来预测类别和box大小，又结合reference和pc_range也就是3D中心点位置xyz和[x_min,y_min,z_min,x_max,y_max,z_max]坐标上下限对tmp的结果进行平移缩放。最后再将不同decoder层的预测结果stack起来进行loss计算。

2.2.4.2 代码

    def forward(self, mlvl_feats, img_metas):         """Forward function.         Args:             mlvl_feats (tuple[Tensor]): Features from the upstream                 network, each is a 5D-tensor with shape                 (B, N, C, H, W).         Returns:             all_cls_scores (Tensor): Outputs from the classification head, \                 shape [nb_dec, bs, num_query, cls_out_channels]. Note \                 cls_out_channels should includes background.             all_bbox_preds (Tensor): Sigmoid outputs from the regression \                 head with normalized coordinate format (cx, cy, w, l, cz, h, theta, vx, vy). \                 Shape [nb_dec, bs, num_query, 9].         """          query_embeds = self.query_embedding.weight      # [900,512]                  hs, init_reference, inter_references = self.transformer(             mlvl_feats,                 # [1,6,256,116,200]、[1,6,256,58,100]、[1,6,256,29,50]、[1,6,256,15,25]             query_embeds,               # [900,512]             reg_branches=self.reg_branches if self.with_box_refine else None,  # 6个全连接层             img_metas=img_metas,)       # 一堆list         hs = hs.permute(0, 2, 1, 3)     # hs: [6,900,1,256]->[6,1,900,256]   init_reference:[1,900,3]   inter_references:[6,1,900,3]         outputs_classes = []         outputs_coords = []          for lvl in range(hs.shape[0]):      # 遍历每一个decoder输出的特征图，多尺度预测             if lvl == 0:                 reference = init_reference             else:                 reference = inter_references[lvl - 1]             reference = inverse_sigmoid(reference)      # 反sigmoid函数             outputs_class = self.cls_branches[lvl](hs[lvl]) # 对每一个decoder输出的特征图[1,900,256]进行全连接映射 [1,900,10]             tmp = self.reg_branches[lvl](hs[lvl])       # [1,900,10]              # TODO: check the shape of reference             assert reference.shape[-1] == 3             tmp[..., 0:2] += reference[..., 0:2]             tmp[..., 0:2] = tmp[..., 0:2].sigmoid()             tmp[..., 4:5] += reference[..., 2:3]             tmp[..., 4:5] = tmp[..., 4:5].sigmoid()              tmp[..., 0:1] = (tmp[..., 0:1] * (self.pc_range[3] - self.pc_range[0]) + self.pc_range[0])             tmp[..., 1:2] = (tmp[..., 1:2] * (self.pc_range[4] - self.pc_range[1]) + self.pc_range[1])             tmp[..., 4:5] = (tmp[..., 4:5] * (self.pc_range[5] - self.pc_range[2]) + self.pc_range[2])              # TODO: check if using sigmoid             outputs_coord = tmp                     # 坐标预测结果             outputs_classes.append(outputs_class)   # 类别预测结果             outputs_coords.append(outputs_coord)    # 坐标预测结果          outputs_classes = torch.stack(outputs_classes)         outputs_coords = torch.stack(outputs_coords)         outs = {             'all_cls_scores': outputs_classes,             'all_bbox_preds': outputs_coords,             'enc_cls_scores': None,             'enc_bbox_preds': None,          }         return outs

2.3 Loss

2.3.1 原理

这边Loss也是比较传统的目标检测loss，由分类损失和回归损失组成，唯一需要注意的是_get_target_single函数，它的主要作用是填充（这边建议仔细食用一下代码）。

2.3.2 代码

loss函数

    def loss(self,              gt_bboxes_list,            # 18个物体的 box 信息              gt_labels_list,            # 18个物体的 label              preds_dicts,               # [[6,1,900,10],[6,1,900,10],[None],[None]]              gt_bboxes_ignore=None):    # None         """"Loss function.         Args:                          gt_bboxes_list (list[Tensor]): Ground truth bboxes for each image                 with shape (num_gts, 4) in [tl_x, tl_y, br_x, br_y] format.             gt_labels_list (list[Tensor]): Ground truth class indices for each                 image with shape (num_gts, ).             preds_dicts:                 all_cls_scores (Tensor): Classification score of all                     decoder layers, has shape                     [nb_dec, bs, num_query, cls_out_channels].                 all_bbox_preds (Tensor): Sigmoid regression                     outputs of all decode layers. Each is a 4D-tensor with                     normalized coordinate format (cx, cy, w, h) and shape                     [nb_dec, bs, num_query, 4].                 enc_cls_scores (Tensor): Classification scores of                     points on encode feature map , has shape                     (N, h*w, num_classes). Only be passed when as_two_stage is                     True, otherwise is None.                 enc_bbox_preds (Tensor): Regression results of each points                     on the encode feature map, has shape (N, h*w, 4). Only be                     passed when as_two_stage is True, otherwise is None.             gt_bboxes_ignore (list[Tensor], optional): Bounding boxes                 which can be ignored for each image. Default None.         Returns:             dict[str, Tensor]: A dictionary of loss components.         """         assert gt_bboxes_ignore is None, \             f'{self.__class__.__name__} only supports ' \             f'for gt_bboxes_ignore setting to None.'          all_cls_scores = preds_dicts['all_cls_scores']  # [6,1,900,10]         all_bbox_preds = preds_dicts['all_bbox_preds']  # [6,1,900,10]         enc_cls_scores = preds_dicts['enc_cls_scores']  # None         enc_bbox_preds = preds_dicts['enc_bbox_preds']  # None          num_dec_layers = len(all_cls_scores)    # decoder的层数：6个         device = gt_labels_list[0].device          # gt_bboxes.gravity_center: 这是gt_bboxes对象的一个属性，代表边界框的重心或质心坐标。         # 它是一个形状为(N, 3)的Tensor，其中N是边界框的数量，3是二维坐标（x,y,z）         # gt_bboxes.tensor[:, 3:]: 这部分从gt_bboxes.tensor这个Tensor中选取了所有行（由:指定）但仅从第4列开始到最后的列。         # 这表示边界框的宽度、高度和其他可能的属性（例如旋转角度等）。假设gt_bboxes.tensor的形状为(N, M)，其中M大于或等于4，那么这部分将返回一个形状为(N, M-3)的Tensor。         gt_bboxes_list = [torch.cat((gt_bboxes.gravity_center, gt_bboxes.tensor[:, 3:]),dim=1).to(device) for gt_bboxes in gt_bboxes_list]          all_gt_bboxes_list = [gt_bboxes_list for _ in range(num_dec_layers)]    # 复制6份，每个decoder复制一份         all_gt_labels_list = [gt_labels_list for _ in range(num_dec_layers)]    # 复制6份，每个decoder复制一份         all_gt_bboxes_ignore_list = [gt_bboxes_ignore for _ in range(num_dec_layers)]   # 复制6份，每个decoder复制一份           losses_cls, losses_bbox = multi_apply(self.loss_single,                                               all_cls_scores, all_bbox_preds,                                               all_gt_bboxes_list,                                               all_gt_labels_list,                                               all_gt_bboxes_ignore_list)          loss_dict = dict()         # loss of proposal generated from encode feature map.         if enc_cls_scores is not None:             binary_labels_list = [torch.zeros_like(gt_labels_list[i])for i in range(len(all_gt_labels_list))]             enc_loss_cls, enc_losses_bbox = self.loss_single(enc_cls_scores, enc_bbox_preds,gt_bboxes_list, binary_labels_list, gt_bboxes_ignore)             loss_dict['enc_loss_cls'] = enc_loss_cls             loss_dict['enc_loss_bbox'] = enc_losses_bbox          # loss from the last decoder layer         loss_dict['loss_cls'] = losses_cls[-1]         loss_dict['loss_bbox'] = losses_bbox[-1]          # loss from other decoder layers         num_dec_layer = 0         for loss_cls_i, loss_bbox_i in zip(losses_cls[:-1],losses_bbox[:-1]):             loss_dict[f'd{num_dec_layer}.loss_cls'] = loss_cls_i             loss_dict[f'd{num_dec_layer}.loss_bbox'] = loss_bbox_i             num_dec_layer += 1         return loss_dict

_get_target_single函数

    def _get_target_single(self,                            cls_score,              # [900,10]                            bbox_pred,              # [900,10]                            gt_labels,              # [18]                            gt_bboxes,              # [18,9]                            gt_bboxes_ignore=None): # None         """"Compute regression and classification targets for one image.         Outputs from a single decoder layer of a single feature level are used.         Args:             cls_score (Tensor): Box score logits from a single decoder layer                 for one image. Shape [num_query, cls_out_channels].             bbox_pred (Tensor): Sigmoid outputs from a single decoder layer                 for one image, with normalized coordinate (cx, cy, w, h) and                 shape [num_query, 4].             gt_bboxes (Tensor): Ground truth bboxes for one image with                 shape (num_gts, 4) in [tl_x, tl_y, br_x, br_y] format.             gt_labels (Tensor): Ground truth class indices for one image                 with shape (num_gts, ).             gt_bboxes_ignore (Tensor, optional): Bounding boxes                 which can be ignored. Default None.         Returns:             tuple[Tensor]: a tuple containing the following for one image.                 - labels (Tensor): Labels of each image.                 - label_weights (Tensor]): Label weights of each image.                 - bbox_targets (Tensor): BBox targets of each image.                 - bbox_weights (Tensor): BBox weights of each image.                 - pos_inds (Tensor): Sampled positive indices for each image.                 - neg_inds (Tensor): Sampled negative indices for each image.         """          num_bboxes = bbox_pred.size(0)      # 900          # assigner and sampler   正负样本的分配与采样         # assigner将预测的边界框与真实的边界框进行匹配，并确定哪些预测框是正样本，哪些是负样本。然后，sampler根据这些匹配结果进行采样，以确保正负样本的平衡。         assign_result = self.assigner.assign(bbox_pred, cls_score, gt_bboxes, gt_labels, gt_bboxes_ignore)         sampling_result = self.sampler.sample(assign_result, bbox_pred,gt_bboxes)          pos_inds = sampling_result.pos_inds     # [18]  # 获取正样本的索引         neg_inds = sampling_result.neg_inds     # [882] # 获取负样本的索引          # label targets         # 初始化一个形状为[900]的Tensor，其中所有元素都是类别的数量（这里作为背景类别的索引）。然后，将正样本的索引对应的标签设置为真实的类别标签。         labels = gt_bboxes.new_full((num_bboxes, ),self.num_classes,dtype=torch.long)       # [900]         labels[pos_inds] = gt_labels[sampling_result.pos_assigned_gt_inds]         # 为所有预测框（包括正样本和负样本）生成标签权重，这里都设为1。但在某些情况下，你可能希望为负样本赋予不同的权重。         label_weights = gt_bboxes.new_ones(num_bboxes)      # [900]          # bbox targets         # 初始化一个与bbox_pred形状相同的Tensor，但只保留前9个通道（假设边界框的坐标和尺寸有9个参数）。         # 然后，初始化一个与bbox_pred形状相同的Tensor，但所有元素都是0。接着，将正样本的索引对应的权重设为1.0         bbox_targets = torch.zeros_like(bbox_pred)[..., :9] # [900,9]         bbox_weights = torch.zeros_like(bbox_pred)          # [900,10]         bbox_weights[pos_inds] = 1.0          # DETR         # 将正样本的索引对应的边界框目标设置为真实的边界框坐标和尺寸。         bbox_targets[pos_inds] = sampling_result.pos_gt_bboxes          # 注意：变量包含了真实值（特别是对于正样本）和一些初始化的值（特别是对于负样本）。         return (labels, label_weights, bbox_targets, bbox_weights, pos_inds, neg_inds)

get_targets函数

    def get_targets(self,                     cls_scores_list,        # [900,10]                     bbox_preds_list,        # [900,10]                     gt_bboxes_list,         # [18,9]  9个label信息，3个xyz，3个长宽高，3个旋转角                     gt_labels_list,         # 18个label                     gt_bboxes_ignore_list=None):         """"Compute regression and classification targets for a batch image.         Outputs from a single decoder layer of a single feature level are used.         Args:             cls_scores_list (list[Tensor]): Box score logits from a single                 decoder layer for each image with shape [num_query,                 cls_out_channels].             bbox_preds_list (list[Tensor]): Sigmoid outputs from a single                 decoder layer for each image, with normalized coordinate                 (cx, cy, w, h) and shape [num_query, 4].             gt_bboxes_list (list[Tensor]): Ground truth bboxes for each image                 with shape (num_gts, 4) in [tl_x, tl_y, br_x, br_y] format.             gt_labels_list (list[Tensor]): Ground truth class indices for each                 image with shape (num_gts, ).             gt_bboxes_ignore_list (list[Tensor], optional): Bounding                 boxes which can be ignored for each image. Default None.         Returns:             tuple: a tuple containing the following targets.                 - labels_list (list[Tensor]): Labels for all images.                 - label_weights_list (list[Tensor]): Label weights for all \                     images.                 - bbox_targets_list (list[Tensor]): BBox targets for all \                     images.                 - bbox_weights_list (list[Tensor]): BBox weights for all \                     images.                 - num_total_pos (int): Number of positive samples in all \                     images.                 - num_total_neg (int): Number of negative samples in all \                     images.         """         assert gt_bboxes_ignore_list is None, \             'Only supports for gt_bboxes_ignore setting to None.'         num_imgs = len(cls_scores_list)     # 1         gt_bboxes_ignore_list = [gt_bboxes_ignore_list for _ in range(num_imgs)]       # None          (labels_list, label_weights_list, bbox_targets_list,          bbox_weights_list, pos_inds_list, neg_inds_list) = multi_apply(              self._get_target_single, cls_scores_list, bbox_preds_list,              gt_labels_list, gt_bboxes_list, gt_bboxes_ignore_list)         num_total_pos = sum((inds.numel() for inds in pos_inds_list))       # 18         num_total_neg = sum((inds.numel() for inds in neg_inds_list))       # 882         return (labels_list, label_weights_list, bbox_targets_list,                 bbox_weights_list, num_total_pos, num_total_neg)

loss_single函数

    def loss_single(self,                     cls_scores,                 # [1,900,10]                     bbox_preds,                 # [1,900,10]                     gt_bboxes_list,             # list[[18,9] 9表示3个xyz，3个长宽高，3个旋转角]                     gt_labels_list,             # list[18个label]                     gt_bboxes_ignore_list=None):# None         """"Loss function for outputs from a single decoder layer of a single         feature level.         Args:             cls_scores (Tensor): Box score logits from a single decoder layer                 for all images. Shape [bs, num_query, cls_out_channels].             bbox_preds (Tensor): Sigmoid outputs from a single decoder layer                 for all images, with normalized coordinate (cx, cy, w, h) and                 shape [bs, num_query, 4].             gt_bboxes_list (list[Tensor]): Ground truth bboxes for each image                 with shape (num_gts, 4) in [tl_x, tl_y, br_x, br_y] format.             gt_labels_list (list[Tensor]): Ground truth class indices for each                 image with shape (num_gts, ).             gt_bboxes_ignore_list (list[Tensor], optional): Bounding                 boxes which can be ignored for each image. Default None.         Returns:             dict[str, Tensor]: A dictionary of loss components for outputs from                 a single decoder layer.         """         num_imgs = cls_scores.size(0)   # 1         cls_scores_list = [cls_scores[i] for i in range(num_imgs)]  # [900,10]         bbox_preds_list = [bbox_preds[i] for i in range(num_imgs)]  # [900,10]         cls_reg_targets = self.get_targets(cls_scores_list, bbox_preds_list,gt_bboxes_list, gt_labels_list, gt_bboxes_ignore_list)         # cls——reg_targets为一个list，里面有6个值，这是真实值          (labels_list, label_weights_list, bbox_targets_list, bbox_weights_list,num_total_pos, num_total_neg) = cls_reg_targets         # 真实的label值         labels = torch.cat(labels_list, 0)                  # [900]         # 真实的label_weight         label_weights = torch.cat(label_weights_list, 0)    # [900]         # 真实的bbox         bbox_targets = torch.cat(bbox_targets_list, 0)      # [900,9]         # 真实bbox_weight         bbox_weights = torch.cat(bbox_weights_list, 0)      # [900,10]          # 预测的类别分数         cls_scores = cls_scores.reshape(-1, self.cls_out_channels)  # [900,10]          # construct weighted avg_factor to match with the official DETR repo         cls_avg_factor = num_total_pos * 1.0 + num_total_neg * self.bg_cls_weight   # 18         if self.sync_cls_avg_factor:             cls_avg_factor = reduce_mean(cls_scores.new_tensor([cls_avg_factor]))   # 1         # 类别损失         cls_avg_factor = max(cls_avg_factor, 1) # [18]         loss_cls = self.loss_cls(cls_scores, labels, label_weights, avg_factor=cls_avg_factor)  # 类别损失  2.2571          # Compute the average number of gt boxes accross all gpus, for         # normalization purposes         num_total_pos = loss_cls.new_tensor([num_total_pos])         num_total_pos = torch.clamp(reduce_mean(num_total_pos), min=1).item()          # regression L1 loss         bbox_preds = bbox_preds.reshape(-1, bbox_preds.size(-1))    # 18.0         normalized_bbox_targets = normalize_bbox(bbox_targets, self.pc_range)   # [900,10]         isnotnan = torch.isfinite(normalized_bbox_targets).all(dim=-1)          # 布尔值         bbox_weights = bbox_weights * self.code_weights                         # [900,10]          loss_bbox = self.loss_bbox(bbox_preds[isnotnan, :10], normalized_bbox_targets[isnotnan, :10], bbox_weights[isnotnan, :10], avg_factor=num_total_pos)          loss_cls = torch.nan_to_num(loss_cls)         loss_bbox = torch.nan_to_num(loss_bbox)         return loss_cls, loss_bbox

总结

（1）DETR3D是DETR2D的一个改进版，主要通过初始化一个3D的object query，将其投影到2D像素平面上和不同视角图像特征进行交互，来预测3D物体的位置。

（2）和LSS、BEVdet等一系列基于深度估计的BEV方案完全不同。

（3）BEVformer可以看成DETR3D的改进版，是DETR3D和BEV方案的结合产物，个人感觉是介于DETR3D和自上而下的BEV方案的中间产物。

参考

DETR3D：将DETR用于3D目标检测任务-CSDN博客

论文精读：《DETR3D: 3D Object Detection from Multi-view Images via 3D-to-2D Queries》-CSDN博客

支持

资讯

自动驾驶-BEV检测篇三：DETR-3D

1、引言

2、pipeline

2.1 img_backbone + grid_mask + img_neck

2.1.1 原理

2.1.2 代码

2.2 Decoder

2.2.1 query初始化 + reference_points初始化

2.2.1.1 原理

2.2.1.2 代码

2.2.2 feature_sampling—2D-3D转换模块

2.2.2.1 原理

2.2.2.2 代码

2.2.3 Detr3DCrossAtten模块

2.2.3.1 原理

2.2.3.2 代码

2.2.4 其他结构

2.2.4.1 原理

2.2.4.2 代码

2.3 Loss

2.3.1 原理

2.3.2 代码

总结

参考

相关阅读

广告一刻