0 Overview-
Detr uses a Transformer based encoder-decoder architecture that treats target detection as a direct set prediction problem to simplify the training pipeline. Given a fixed set of learning target queries, DETR deduces the relationship between the target and the global image context to output the final prediction set directly in parallel. The Detr model is conceptually simple and does not require specialized libraries. In addition, DETR can be easily generalized to produce panoramic segmentation in a uniform manner. This paper introduces the detr target detection algorithm and explains its use.-
1 Performance and accuracy indicators
dataset
input_shape
backbone
FPS(dual-core)
floating point mAP
quantization mAP
mscoco
[1, 3, 800, 1332]
efficientnet-b3
62.29
37.21
35.97
mscoco
[1, 3, 800, 1332]
resnet50
47.32
35.69
31.34
2 Model optimization point description
- resnet50: Compared with the official version, dynamic shape is changed to fixed shape during inference.
- efficientnet-b3: The training strategy was optimized, and the data enhancement method was different from the public version;
- Optimization for transformer, including dimensional adjustments (3 ->4 dimensional) and op execution sequence (to accommodate BPU characteristics and performance acceleration), logic consistent with the public version.
3 Model introduction
Model structure
image.png
DETR consists of the following four parts: CNN backbone : Uses the CNN backbone to learn 2D representations of input images, divided into resnet50 and efficientnet-b3 Transformer Encoder -decoder: Learn the global information of the image in Transformer Encoder, decoder decodes the output of Encoder Head: Through the FFN network, output the forecast result
backbone
backbone extracted 2D image features for CNN, DETR model provided two backbone schemes: resnet50 and efficientnet-b3, with downsampling multiple of: 32
- Input shape:[1, 3, 800, 1332]
- Output shape:efficientnet-b3:[1, 384, 25, 42] resnet50:[1, 2048, 25, 42]
resnet50
The corresponding code: hat/models/backbones/resnet. Py, achieve consistent with the male version
efficientnet-b3
The corresponding code: hat/models/backbones/efficientnet py, horizon support network structure, and high efficiency b indicates different efficientNets, corresponding to different resolutions and network depths. For b3, [width_coefficient,depth_coefficient,default_resolution,dropout_rate] is set to: (1.2, 1.4, 300, 0.3)
transformer
-
Use 1x1 conv to reduce the number of feature channels from high dimension to 256
self.input_proj = nn.Conv2d(
self.in_channels, self.embed_dims, kernel_size=1
)
x = self.input_proj(x) -
Create a mask and determine the padding and interpolation according to the shape of the input image. The dimension is [B,H,W] → [B,feat_h,feat_w].
masks = torch.zeros(
(batch_size, input_img_h, input_img_w), device=x.device
)interpolate masks to have the same spatial shape with x
masks = (
F.interpolate(masks.unsqueeze(1), size=x.shape[-2:])
.to(torch.bool)
.squeeze(1)
) -
Position encoding Code:
hat/models/embeddings.pyclass PositionEmbeddingSine(nn.Module):
…
def forward(self, mask):
not_mask = ~mask
y_embed = not_mask.cumsum(1, dtype=torch.float32)
x_embed = not_mask.cumsum(2, dtype=torch.float32)
if self.normalize:
eps = 1e-6
y_embed = y_embed / (y_embed[:, -1:, :] + eps) * self.scale
x_embed = x_embed / (x_embed[:, :, -1:] + eps) * self.scaledim_t = torch.arange( self.num_pos_feats, dtype=torch.float32, device=mask.device ) dim_t = self.temperature ** (2 * (dim_t // 2) / self.num_pos_feats) pos_x = x_embed[:, :, :, None] / dim_t pos_y = y_embed[:, :, :, None] / dim_t pos_x = torch.stack( (pos_x[:, :, :, 0::2].sin(), pos_x[:, :, :, 1::2].cos()), dim=4 ).flatten(3) pos_y = torch.stack( (pos_y[:, :, :, 0::2].sin(), pos_y[:, :, :, 1::2].cos()), dim=4 ).flatten(3) pos = torch.cat((pos_y, pos_x), dim=3).permute(0, 3, 1, 2) return pos -
transformer encoder-decoder Code:
hat/models/task_modules/detr/transformer.pyclass Transformer(nn.Module):
…
bs, c, h, w = x.shape
query_embed = (
query_embed.transpose(0, 1)
.unsqueeze(0)
.repeat(bs, 1, 1)
.contiguous()
.view(bs, query_embed.shape[1], 2, query_embed.shape[0] // 2)
) # [num_query, dim] → [bs, dim, 2, num_query/2]
mask = mask.flatten(1) # [bs, h, w] → [bs, h*w]tgt = torch.zeros_like(query_embed) # [bs, dim, 2, num_query/2] # tgt = torch.zeros(query_embed.size(), device=query_embed.device) tgt = self.tgt_quant(tgt) memory = self.encoder(x, src_key_padding_mask=mask, pos=pos_embed) hs = self.decoder( tgt, memory, memory_key_padding_mask=mask, pos=pos_embed, query_pos=query_embed, ) # [nb_dec, bs, dim, 2, num_query/2] hs = ( hs.contiguous() .view( hs.shape[0], hs.shape[1], hs.shape[2], hs.shape[3] * hs.shape[4], ) .permute(0, 1, 3, 2) ) # [nb_dec, bs, num_query, embed_dim]
TransformerEncoderLayer:MultiheadAttention+FFN,layer = 6
class TransformerEncoderLayer(nn.Module):
...
def forward_post(
self,
src,
src_mask: Optional[Tensor] = None,
src_key_padding_mask: Optional[Tensor] = None,
pos: Optional[Tensor] = None,
):
#对 Q K 进行更新,output shape: [bs, c, h, w]
q = k = self.with_pos_embed(src, pos)
#MultiheadAttention
src2 = self.self_attn(
q,
k,
value=src,
attn_mask=src_mask,
key_padding_mask=src_key_padding_mask,
)[
0
] # [bs, embed_dim, h, w]
#shortcut and norm
src = self.dropout1_add.add(src, self.dropout1(src2))
src = self.norm1(src)
#FFN
src2 = self.linear2(self.dropout(self.activation(self.linear1(src))))
src = self.dropout2_add.add(src, self.dropout2(src2))
src = self.norm2(src) # [bs, c, h, w]
TransformerDecoderLayer:MultiheadAttention+MultiheadAttention+FFN,layer = 6
class TransformerDecoderLayer(nn.Module):
...
def forward_post(
self,
tgt,
memory,
tgt_mask: Optional[Tensor] = None,
memory_mask: Optional[Tensor] = None,
tgt_key_padding_mask: Optional[Tensor] = None,
memory_key_padding_mask: Optional[Tensor] = None,
pos: Optional[Tensor] = None,
query_pos: Optional[Tensor] = None,
):
...
self.self_attn = MultiheadAttention(d_model, nhead, dropout=dropout)
self.multihead_attn = MultiheadAttention(
d_model, nhead, dropout=dropout
)
q = k = self.with_pos_embed(
tgt, query_pos
) # [bs, dim, 2, num_query/2]
#MultiheadAttention
tgt2 = self.self_attn(
q,
k,
value=tgt,
attn_mask=tgt_mask,
key_padding_mask=tgt_key_padding_mask,
)[
0
] # [bs, dim, 2, num_query/2]
tgt = self.dropout1_add.add(tgt, self.dropout1(tgt2))
tgt = self.norm1(tgt)
#MultiheadAttention
tgt2 = self.multihead_attn(
query=self.with_pos_embed(tgt, query_pos),
key=self.with_pos_embed(memory, pos),
value=memory,
attn_mask=memory_mask,
key_padding_mask=memory_key_padding_mask,
)[
0
] # [bs, dim, 2, num_query/2]
#shortcut and norm
tgt = self.dropout2_add.add(tgt, self.dropout2(tgt2))
tgt = self.norm2(tgt)
#FFN
tgt2 = self.linear2(self.dropout(self.activation(self.linear1(tgt))))
tgt = self.dropout3_add.add(tgt, self.dropout3(tgt2))
tgt = self.norm3(tgt)
return tgt # [bs, dim, 2, num_query/2]
MultiheadAttention:4 dims attention layer
class MultiheadAttention(nn.Module):
...
def forward(
self,
query: Tensor,
key: Tensor,
value: Tensor,
key_padding_mask: Optional[Tensor] = None,
attn_mask: Optional[Tensor] = None,
):
# set up shape vars
bsz, embed_dim, tgt_h, tgt_w = query.shape
_, _, src_h, src_w = key.shape
...
#q k v projection layers
q = self.q_proj(query)
k = self.k_proj(key)
v = self.v_proj(value)
q = (
q.contiguous()
.view(bsz * self.num_heads, self.head_dim, tgt_h, tgt_w)
.permute(0, 2, 3, 1)
)
k = k.contiguous().view(
bsz * self.num_heads, 1, self.head_dim, src_h * src_w
)
v = (
v.contiguous()
.view(bsz * self.num_heads, 1, self.head_dim, src_h * src_w)
.permute(0, 1, 3, 2)
)
# update source sequence length after adjustments
src_len = k.size(3)
# merge key padding and attention masks
if key_padding_mask is not None:
...
key_padding_mask = (
key_padding_mask.view(bsz, 1, 1, src_len)
.expand(-1, self.num_heads, -1, -1)
.reshape(bsz * self.num_heads, 1, src_len)
)
if attn_mask is None:
attn_mask = key_padding_mask
elif attn_mask.dtype == torch.bool:
attn_mask = attn_mask.logical_or(key_padding_mask)
else:
attn_mask = attn_mask.masked_fill(
key_padding_mask, float("-100")
)
if attn_mask is not None and attn_mask.dtype == torch.bool:
new_attn_mask = torch.zeros_like(attn_mask, dtype=torch.float)
new_attn_mask.masked_fill_(attn_mask, float("-100"))
attn_mask = new_attn_mask
if attn_mask is not None:
attn_mask = self.attn_mask_quant(attn_mask)
# q = q * self.scale
# attention = (q @ k.transpose(-2, -1))
scale = self.scale_quant(self.scale)
q = self.mul.mul(
q, scale
) # [bsz*self.num_heads, tgt_h, tgt_w, self.head_dim]
attention = self.matmul.matmul(
q, k
) # [bsz*self.num_heads, tgt_h, tgt_w, src_h*src_w]
if attn_mask is not None:
# attention = attention + mask
attn_mask = attn_mask.contiguous().unsqueeze(1)
attention = self.mask_add.add(attention, attn_mask)
attention = self.softmax(attention)
else:
attention = self.softmax(attention)
attention = self.attention_drop(attention)
# output = (attention @ v)
attn_output = self.attn_matmul.matmul(
attention, v
) # [bsz*self.num_heads, tgt_h, tgt_w, self.head_dim]
attn_output = (
attn_output.permute(0, 3, 1, 2)
.contiguous()
.view(bsz, embed_dim, tgt_h, tgt_w)
)
attn_output = self.out_proj(attn_output)
return attn_output, None
head
Two fc layers are formed, one for classification and one for regression prediction It is classified as a one-layer model structure, and the final prediction is MLP+RELU+linear. FFN prediction box for standardized center coordinates, height and width, input image.
self.fc_cls = nn.Linear(self.embed_dims, self.cls_out_channels)
self.fc_reg = nn.Linear(self.embed_dims, 4)
self.activate = nn.ReLU(inplace=True)
self.reg_ffn = MlpModule2d(
self.embed_dims,
self.embed_dims,
self.embed_dims,
act_layer=self.act_layer,
drop_ratio=0.0,
)
#
outputs_class = self.fc_cls(outs_dec)
outputs_coord = self.fc_reg(self.activate(self.reg_ffn(outs_dec)))
3 floating point training
3.1 Before Start
3.1.1 Environment Deployment
The DETR sample is located in the OE package under ‘ddk/samples/ai_toolchain/horizon_model_train_sample’ and has the following structure:
└── horizon_model_train_sample
├── scripts
├── configs
├── tools
`── examples
docker:
docker pull openexplorer/ai_toolchain_ubuntu_20_j5_cpu:"$version"
nvidia-docker run -it --shm-size="15g" -v `pwd`:/open_explorer openexplorer/ai_toolchain_ubuntu_20_j5_cpu:"$version" /bin/bash
3.1.2 Data packaging
Package the training and validation datasets in lmdb format using the following command:
#pack train_Set
python3 tools/datasets/mscoco_packer.py --src-data-dir ${src-data-dir} --target-data-dir ${target-data-dir} --split-name train --num-workers 10 --pack-type lmdb
#pack test_Set
python3 tools/datasets/mscoco_packer.py --src-data-dir ${src-data-dir} --target-data-dir ${target-data-dir} --split-name train --num-workers 10 --pack-type lmdb
train_lmdb and val_lmdb are the packaged training data set and validation data set, which are also the final data read by the network.
3.2 Floating-point model training In configs/detection/detr/detr_efficientnetb3_mscoco. Py configuration parameters, the need to amend the relevant hardware configuration and data set path configuration using the following command after training floating-point model:
python3 tools/train.py --stage float --config configs/detection/detr/detr_efficientnetb3_mscoco.py
3.3 Floating point model validation
Verify the trained model accuracy by specifying the trained float_checkpoint_path with the following command:
python3 tools/predict.py --stage float --config configs/detection/detr/detr_efficientnetb3_mscoco.py
4 Model quantization and compilation
Before loading the model onto the board, the model needs to be compiled into.hbm file. The tool ‘compile’ can be used to compile the quantized model into a ‘hbm’ file that can be run on the board. Therefore, the floating point model needs to be quantized first. The fixed point model is finally obtained through QAT quantitative training and transformation.
4.1 Quantitative model training After the floating-point model is trained, the model can be trained quantitatively. Quantization training is actually finetue based on pure floating-point training. Specific configuration information is defined in qat_trainer of config. When quantizing training, the initial learning rate is set to one-tenth of the floating-point training, and the number of EPOches trained is greatly reduced. You can start training the fixed-point model by running the following script:
python3 tools/train.py --stage qat --config configs/detection/detr/detr_efficientnetb3_mscoco.py
4.2 Quantitative model verification To verify the accuracy of the quantized model, you only need to run the following command:
python3 tools/predict.py --stage qat --config configs/detection/detr/detr_efficientnetb3_mscoco.py
python3 tools/predict.py --stage int_infer --config configs/detection/detr/detr_efficientnetb3_mscoco.py
The accuracy verification object of the qat model is the model with the pseudo-quantization node inserted (float32); The precision verification object of the quantize model is the fixed-point model (int8), and the accuracy of the verification is the true accuracy of the final int8 model. The two accuracies should be very close.
4.3 Accuracy verification of the simulation board In addition to the above model verification, we also provide a precision verification method that is exactly the same as the upper board, which can be done in the following ways:
python3 tools/align_bpu_validation.py --config configs/detection/detr/detr_efficientnetb3_mscoco.py
4.4 Quantitative model compilation After the training is completed, the tool ‘compile’ can be used to compile the quantized model into a ‘hbm’ file that can be run on the board. At the same time, the tool can also predict the running performance on the BPU. The following scripts can be used:
python3 tools/compile_perf.py --config configs/detection/detr/detr_efficientnetb3_mscoco.py --out-dir ./ --opt 3
opt indicates the optimization level. The value ranges from 0 to 3. The larger the number, the higher the optimization level and the longer the running time. The compile_perf script will generate the.html file and the.hbm file (in the compile file directory), the.html file for the run performance on the BPU, and the.hbm file for the board measurement file.
4.5 Upper board performance measurement Use the hrt_model_exec perf tool to test the BPU performance of the generated.hbm file on the board. The hrt_model_exec perf parameters are as follows:
hrt_model_exec perf --model_file {model}.hbm \
--thread_num 8 \
--frame_count 2000 \
--core_id 0 \
--profile_path '.'
4.6 Result visualization
If you want to see the effect of the trained model on a single frame, our tools folder also provides a point cloud prediction and visualization script, you just need to run the following script:
python3 tools/infer.py --config configs/detection/detr/detr_efficientnetb3_mscoco.py --save_path ./
The infer_cfg field needs to be configured in the config file.
Visual example: