[Reference Algorithm] Horizon DETR Reference Algorithm-v1.2.2

0 Overview-


Detr uses a Transformer based encoder-decoder architecture that treats target detection as a direct set prediction problem to simplify the training pipeline. Given a fixed set of learning target queries, DETR deduces the relationship between the target and the global image context to output the final prediction set directly in parallel. The Detr model is conceptually simple and does not require specialized libraries. In addition, DETR can be easily generalized to produce panoramic segmentation in a uniform manner. This paper introduces the detr target detection algorithm and explains its use.-

1 Performance and accuracy indicators

dataset

input_shape

backbone

FPS(dual-core)

floating point mAP

quantization mAP

mscoco

[1, 3, 800, 1332]

efficientnet-b3

62.29

37.21

35.97

mscoco

[1, 3, 800, 1332]

resnet50

47.32

35.69

31.34

2 Model optimization point description

  • resnet50: Compared with the official version, dynamic shape is changed to fixed shape during inference.
  • efficientnet-b3: The training strategy was optimized, and the data enhancement method was different from the public version;
  • Optimization for transformer, including dimensional adjustments (3 ->4 dimensional) and op execution sequence (to accommodate BPU characteristics and performance acceleration), logic consistent with the public version.

3 Model introduction

Model structure

image.pngimage.png

DETR consists of the following four parts: CNN backbone : Uses the CNN backbone to learn 2D representations of input images, divided into resnet50 and efficientnet-b3 Transformer Encoder -decoder: Learn the global information of the image in Transformer Encoder, decoder decodes the output of Encoder Head: Through the FFN network, output the forecast result

backbone

backbone extracted 2D image features for CNN, DETR model provided two backbone schemes: resnet50 and efficientnet-b3, with downsampling multiple of: 32

  • Input shape:[1, 3, 800, 1332]
  • Output shape:efficientnet-b3:[1, 384, 25, 42] resnet50:[1, 2048, 25, 42]

resnet50

The corresponding code: hat/models/backbones/resnet. Py, achieve consistent with the male version

efficientnet-b3

The corresponding code: hat/models/backbones/efficientnet py, horizon support network structure, and high efficiency b indicates different efficientNets, corresponding to different resolutions and network depths. For b3, [width_coefficient,depth_coefficient,default_resolution,dropout_rate] is set to: (1.2, 1.4, 300, 0.3)

transformer

  1. Use 1x1 conv to reduce the number of feature channels from high dimension to 256

    self.input_proj = nn.Conv2d(
    self.in_channels, self.embed_dims, kernel_size=1
    )
    x = self.input_proj(x)

  2. Create a mask and determine the padding and interpolation according to the shape of the input image. The dimension is [B,H,W] → [B,feat_h,feat_w].

    masks = torch.zeros(
    (batch_size, input_img_h, input_img_w), device=x.device
    )

    interpolate masks to have the same spatial shape with x

    masks = (
    F.interpolate(masks.unsqueeze(1), size=x.shape[-2:])
    .to(torch.bool)
    .squeeze(1)
    )

  3. Position encoding Code:hat/models/embeddings.py

    class PositionEmbeddingSine(nn.Module):

    def forward(self, mask):
    not_mask = ~mask
    y_embed = not_mask.cumsum(1, dtype=torch.float32)
    x_embed = not_mask.cumsum(2, dtype=torch.float32)
    if self.normalize:
    eps = 1e-6
    y_embed = y_embed / (y_embed[:, -1:, :] + eps) * self.scale
    x_embed = x_embed / (x_embed[:, :, -1:] + eps) * self.scale

        dim_t = torch.arange(
            self.num_pos_feats, dtype=torch.float32, device=mask.device
        )
        dim_t = self.temperature ** (2 * (dim_t // 2) / self.num_pos_feats)
    
        pos_x = x_embed[:, :, :, None] / dim_t
        pos_y = y_embed[:, :, :, None] / dim_t
        pos_x = torch.stack(
            (pos_x[:, :, :, 0::2].sin(), pos_x[:, :, :, 1::2].cos()), dim=4
        ).flatten(3)
        pos_y = torch.stack(
            (pos_y[:, :, :, 0::2].sin(), pos_y[:, :, :, 1::2].cos()), dim=4
        ).flatten(3)
        pos = torch.cat((pos_y, pos_x), dim=3).permute(0, 3, 1, 2)
        return pos
    
  4. transformer encoder-decoder Code:hat/models/task_modules/detr/transformer.py

    class Transformer(nn.Module):

    bs, c, h, w = x.shape
    query_embed = (
    query_embed.transpose(0, 1)
    .unsqueeze(0)
    .repeat(bs, 1, 1)
    .contiguous()
    .view(bs, query_embed.shape[1], 2, query_embed.shape[0] // 2)
    ) # [num_query, dim] → [bs, dim, 2, num_query/2]
    mask = mask.flatten(1) # [bs, h, w] → [bs, h*w]

    tgt = torch.zeros_like(query_embed)  # [bs, dim, 2, num_query/2]
    # tgt = torch.zeros(query_embed.size(), device=query_embed.device)
    tgt = self.tgt_quant(tgt)
    memory = self.encoder(x, src_key_padding_mask=mask, pos=pos_embed)
    hs = self.decoder(
        tgt,
        memory,
        memory_key_padding_mask=mask,
        pos=pos_embed,
        query_pos=query_embed,
    )  # [nb_dec, bs, dim, 2, num_query/2]
    hs = (
        hs.contiguous()
        .view(
            hs.shape[0],
            hs.shape[1],
            hs.shape[2],
            hs.shape[3] * hs.shape[4],
        )
        .permute(0, 1, 3, 2)
    )  # [nb_dec, bs, num_query, embed_dim]
    

TransformerEncoderLayer:MultiheadAttention+FFN,layer = 6

class TransformerEncoderLayer(nn.Module):
    ...
    def forward_post(
        self,
        src,
        src_mask: Optional[Tensor] = None,
        src_key_padding_mask: Optional[Tensor] = None,
        pos: Optional[Tensor] = None,
    ):
        #对 Q K 进行更新,output shape: [bs, c, h, w]
        q = k = self.with_pos_embed(src, pos)  
        #MultiheadAttention
        src2 = self.self_attn(
            q,
            k,
            value=src,
            attn_mask=src_mask,
            key_padding_mask=src_key_padding_mask,
        )[
            0
        ]  # [bs, embed_dim, h, w]
        #shortcut and norm
        src = self.dropout1_add.add(src, self.dropout1(src2))
        src = self.norm1(src)
        #FFN
        src2 = self.linear2(self.dropout(self.activation(self.linear1(src))))
        src = self.dropout2_add.add(src, self.dropout2(src2))
        src = self.norm2(src)  # [bs, c, h, w]

TransformerDecoderLayer:MultiheadAttention+MultiheadAttention+FFN,layer = 6

class TransformerDecoderLayer(nn.Module):
    ...
    def forward_post(
        self,
        tgt,
        memory,
        tgt_mask: Optional[Tensor] = None,
        memory_mask: Optional[Tensor] = None,
        tgt_key_padding_mask: Optional[Tensor] = None,
        memory_key_padding_mask: Optional[Tensor] = None,
        pos: Optional[Tensor] = None,
        query_pos: Optional[Tensor] = None,
    ):
        ...
        self.self_attn = MultiheadAttention(d_model, nhead, dropout=dropout)
        self.multihead_attn = MultiheadAttention(
            d_model, nhead, dropout=dropout
        )
        q = k = self.with_pos_embed(
            tgt, query_pos
        )  # [bs, dim, 2, num_query/2]
        #MultiheadAttention
        tgt2 = self.self_attn(
            q,
            k,
            value=tgt,
            attn_mask=tgt_mask,
            key_padding_mask=tgt_key_padding_mask,
        )[
            0
        ]  # [bs, dim, 2, num_query/2]
        tgt = self.dropout1_add.add(tgt, self.dropout1(tgt2))
        tgt = self.norm1(tgt)
        #MultiheadAttention
        tgt2 = self.multihead_attn(
            query=self.with_pos_embed(tgt, query_pos),
            key=self.with_pos_embed(memory, pos),
            value=memory,
            attn_mask=memory_mask,
            key_padding_mask=memory_key_padding_mask,
        )[
            0
        ]  # [bs, dim, 2, num_query/2]
        #shortcut and norm
        tgt = self.dropout2_add.add(tgt, self.dropout2(tgt2))
        tgt = self.norm2(tgt)
        #FFN
        tgt2 = self.linear2(self.dropout(self.activation(self.linear1(tgt))))
        tgt = self.dropout3_add.add(tgt, self.dropout3(tgt2))
        tgt = self.norm3(tgt)
        return tgt  # [bs, dim, 2, num_query/2]

MultiheadAttention:4 dims attention layer

class MultiheadAttention(nn.Module):
    ...
    def forward(
        self,
        query: Tensor,
        key: Tensor,
        value: Tensor,
        key_padding_mask: Optional[Tensor] = None,
        attn_mask: Optional[Tensor] = None,
    ):
        # set up shape vars
        bsz, embed_dim, tgt_h, tgt_w = query.shape
        _, _, src_h, src_w = key.shape
        ...
        #q k v projection layers
        q = self.q_proj(query)
        k = self.k_proj(key)
        v = self.v_proj(value)

        q = (
            q.contiguous()
            .view(bsz * self.num_heads, self.head_dim, tgt_h, tgt_w)
            .permute(0, 2, 3, 1)
        )
        k = k.contiguous().view(
            bsz * self.num_heads, 1, self.head_dim, src_h * src_w
        )
        v = (
            v.contiguous()
            .view(bsz * self.num_heads, 1, self.head_dim, src_h * src_w)
            .permute(0, 1, 3, 2)
        )

        # update source sequence length after adjustments
        src_len = k.size(3)

        # merge key padding and attention masks
        if key_padding_mask is not None:
            ...
            key_padding_mask = (
                key_padding_mask.view(bsz, 1, 1, src_len)
                .expand(-1, self.num_heads, -1, -1)
                .reshape(bsz * self.num_heads, 1, src_len)
            )
            if attn_mask is None:
                attn_mask = key_padding_mask
            elif attn_mask.dtype == torch.bool:
                attn_mask = attn_mask.logical_or(key_padding_mask)
            else:
                attn_mask = attn_mask.masked_fill(
                    key_padding_mask, float("-100")
                )

        if attn_mask is not None and attn_mask.dtype == torch.bool:
            new_attn_mask = torch.zeros_like(attn_mask, dtype=torch.float)
            new_attn_mask.masked_fill_(attn_mask, float("-100"))
            attn_mask = new_attn_mask
        if attn_mask is not None:
            attn_mask = self.attn_mask_quant(attn_mask)

        # q = q * self.scale
        # attention = (q @ k.transpose(-2, -1))
        scale = self.scale_quant(self.scale)
        q = self.mul.mul(
            q, scale
        )  # [bsz*self.num_heads, tgt_h, tgt_w, self.head_dim]

        attention = self.matmul.matmul(
            q, k
        )  # [bsz*self.num_heads, tgt_h, tgt_w, src_h*src_w]

        if attn_mask is not None:
            # attention = attention + mask
            attn_mask = attn_mask.contiguous().unsqueeze(1)
            attention = self.mask_add.add(attention, attn_mask)
            attention = self.softmax(attention)
        else:
            attention = self.softmax(attention)

        attention = self.attention_drop(attention)
        # output = (attention @ v)
        attn_output = self.attn_matmul.matmul(
            attention, v
        )  # [bsz*self.num_heads, tgt_h, tgt_w, self.head_dim]
        attn_output = (
            attn_output.permute(0, 3, 1, 2)
            .contiguous()
            .view(bsz, embed_dim, tgt_h, tgt_w)
        )
        attn_output = self.out_proj(attn_output)

        return attn_output, None

head

Two fc layers are formed, one for classification and one for regression prediction It is classified as a one-layer model structure, and the final prediction is MLP+RELU+linear. FFN prediction box for standardized center coordinates, height and width, input image.

self.fc_cls = nn.Linear(self.embed_dims, self.cls_out_channels)
self.fc_reg = nn.Linear(self.embed_dims, 4)
self.activate = nn.ReLU(inplace=True)
self.reg_ffn = MlpModule2d(
    self.embed_dims,
    self.embed_dims,
    self.embed_dims,
    act_layer=self.act_layer,
    drop_ratio=0.0,
)
#
outputs_class = self.fc_cls(outs_dec)
outputs_coord = self.fc_reg(self.activate(self.reg_ffn(outs_dec)))

3 floating point training

3.1 Before Start

3.1.1 Environment Deployment

The DETR sample is located in the OE package under ‘ddk/samples/ai_toolchain/horizon_model_train_sample’ and has the following structure:

└── horizon_model_train_sample    
    ├── scripts            
        ├── configs    
        ├── tools
        `── examples

docker:

docker pull openexplorer/ai_toolchain_ubuntu_20_j5_cpu:"$version"
nvidia-docker run -it --shm-size="15g" -v `pwd`:/open_explorer openexplorer/ai_toolchain_ubuntu_20_j5_cpu:"$version" /bin/bash

3.1.2 Data packaging

Package the training and validation datasets in lmdb format using the following command:

#pack train_Set
python3 tools/datasets/mscoco_packer.py --src-data-dir ${src-data-dir} --target-data-dir ${target-data-dir} --split-name train --num-workers 10 --pack-type lmdb
#pack test_Set
python3 tools/datasets/mscoco_packer.py --src-data-dir ${src-data-dir} --target-data-dir ${target-data-dir} --split-name train --num-workers 10 --pack-type lmdb

train_lmdb and val_lmdb are the packaged training data set and validation data set, which are also the final data read by the network.

3.2 Floating-point model training In configs/detection/detr/detr_efficientnetb3_mscoco. Py configuration parameters, the need to amend the relevant hardware configuration and data set path configuration using the following command after training floating-point model:

python3 tools/train.py --stage float --config configs/detection/detr/detr_efficientnetb3_mscoco.py

3.3 Floating point model validation

Verify the trained model accuracy by specifying the trained float_checkpoint_path with the following command:

python3 tools/predict.py --stage float --config configs/detection/detr/detr_efficientnetb3_mscoco.py

4 Model quantization and compilation

Before loading the model onto the board, the model needs to be compiled into.hbm file. The tool ‘compile’ can be used to compile the quantized model into a ‘hbm’ file that can be run on the board. Therefore, the floating point model needs to be quantized first. The fixed point model is finally obtained through QAT quantitative training and transformation.

4.1 Quantitative model training After the floating-point model is trained, the model can be trained quantitatively. Quantization training is actually finetue based on pure floating-point training. Specific configuration information is defined in qat_trainer of config. When quantizing training, the initial learning rate is set to one-tenth of the floating-point training, and the number of EPOches trained is greatly reduced. You can start training the fixed-point model by running the following script:

python3 tools/train.py --stage qat --config configs/detection/detr/detr_efficientnetb3_mscoco.py

4.2 Quantitative model verification To verify the accuracy of the quantized model, you only need to run the following command:

python3 tools/predict.py --stage qat --config configs/detection/detr/detr_efficientnetb3_mscoco.py

python3 tools/predict.py --stage int_infer --config configs/detection/detr/detr_efficientnetb3_mscoco.py

The accuracy verification object of the qat model is the model with the pseudo-quantization node inserted (float32); The precision verification object of the quantize model is the fixed-point model (int8), and the accuracy of the verification is the true accuracy of the final int8 model. The two accuracies should be very close.

4.3 Accuracy verification of the simulation board In addition to the above model verification, we also provide a precision verification method that is exactly the same as the upper board, which can be done in the following ways:

python3 tools/align_bpu_validation.py --config configs/detection/detr/detr_efficientnetb3_mscoco.py

4.4 Quantitative model compilation After the training is completed, the tool ‘compile’ can be used to compile the quantized model into a ‘hbm’ file that can be run on the board. At the same time, the tool can also predict the running performance on the BPU. The following scripts can be used:

python3 tools/compile_perf.py --config configs/detection/detr/detr_efficientnetb3_mscoco.py --out-dir ./ --opt 3

opt indicates the optimization level. The value ranges from 0 to 3. The larger the number, the higher the optimization level and the longer the running time. The compile_perf script will generate the.html file and the.hbm file (in the compile file directory), the.html file for the run performance on the BPU, and the.hbm file for the board measurement file.

4.5 Upper board performance measurement Use the hrt_model_exec perf tool to test the BPU performance of the generated.hbm file on the board. The hrt_model_exec perf parameters are as follows:

hrt_model_exec perf --model_file {model}.hbm \
                    --thread_num 8 \
                    --frame_count 2000 \
                    --core_id 0 \
                    --profile_path '.'

4.6 Result visualization

If you want to see the effect of the trained model on a single frame, our tools folder also provides a point cloud prediction and visualization script, you just need to run the following script:

python3 tools/infer.py --config configs/detection/detr/detr_efficientnetb3_mscoco.py --save_path ./

The infer_cfg field needs to be configured in the config file.

Visual example: