[Model Acceleration] How to achieve 50ms inference with the YOLOv5 model on X3pi

D-Robotics · June 22, 2023, 11:12pm

According to the toolchain instructions and tutorial posts from experts in the Horizon community, the poster successfully deployed YOLOv5 on the X3pi board, but the inference time was over 300ms. The experts mentioned the Python accelerated post-processing part in their tutorial, but the optimization of YOLOv5 model inference time has hardly been seen on the Horizon site.

After consulting with Mr. Xu, I was inspired and opened up my thinking. After flipping through the X3pi manual, I found that there is a mention of the acceleration method for YOLOv5

The basic principle is as shown in the diagram. The post-processing layer of the model has been removed (which is why the YOLOv5 model in the demo has three outputs), and this part is completed on the board instead of inside the model, thereby accelerating the inference speed of the model. The post-processing part has already been accelerated by using C++packaged code on the website, and will not be elaborated here.

First, let’s talk about the drawbacks: This method of modifying the model is only effective in YOLOv5 v2.0. Attempts have been made to convert 2.0 code to a 7.0 model, or to modify it directly in version 7.0, but all attempts have failed; Compared to the weights of subsequent versions, the weight file of 2.0 is slower without optimization training, and the high batch size GPU cannot run. The main graphics card is RTX3060, which runs 8 batch sizes with x weights in 7.0. In 2.0, it can only run s

Secondly, version 2.0 is from two years ago and has bugs and minor conflicts with the current environment

This article will start with fixing the bug in YOLOv5 v2.0, and gradually explain the acceleration method of inference by modifying the model

Preparation in advance

First, go to the YOLOv5 official website> https://github.com/ultralytics/yolov5/ Tag selection v2.0, package project

Download weights. The weights for each version may not be universal. Here, you need to download the weights for version 2.0. Release in the lower right corner, click on the tag on the new page, and find the tag for version 2.0

The new page has declined, and these are the v2.0 models. We suggest using S to test the water level first, as the graphics card for models that are too large may not work

Unzip the downloaded file and place the weights in the weights section

Arrange the training set, modify the yaml file and train parameters, etc

Entering the yolov5 environment

BUG modification

First run

Running train.py resulted in the following error:

This is probably due to compatibility issues with the torch version. The solution is: Arrive at around 130 lines of models/yolo. py

Disable gradient calculation

Second run

Running again, the following error occurred (with the word int)

This error is due to the deletion of np. int in the new version of the numpy library, and using int directly is sufficient In VScode, find the replacement in the file. To prevent replacing statements such as np-int16 or np-interp, we replace (np. int) with (int)

Third run Running again, it was found that an error occurred while running the 0th epoch, and the error contained the word CPU

This error is caused by variable a not being in the GPU Don’t panic, come to line 533 of utils/utils.py

Add the following code to transfer variables to the GPU

Fourth run The error is as follows

Solution: Go to line 916 of utils/utils.py

Make the following modifications

For copying:

targets_cpu = []
    for sublist in targets:
        sublist_cpu = []
        for item in sublist:
            if isinstance(item, torch.Tensor) and item.is_cuda:
                sublist_cpu.append(item.cpu())
            else:
                sublist_cpu.append(item)
        targets_cpu.append(sublist_cpu)

Run the code again, and you will be able to run through and start training

Here’s another mention, some students may encounter the following error message

This is because the weight model of YOLOv5 v2.0 is not optimized compared to later versions, occupying too much memory. However, in order to accelerate inference, it has to be trained using v2.0. Therefore, please reduce the batch size or use lightweight weights Training completed, test your reasoning

The effect is very good At this point, we have trained the v2.0 version of the model, and now we can start modifying the output layer of the model

Modify the model and convert it to onnx

Simply put, the modification method is: First, copy models/export.py to the outermost directory

Open export.py from the outermost directory and go to the main function to modify the path of the newly trained model. Pay special attention to the parameter img size, which should be a multiple of 32, and be sure to remember your model size (this parameter)

Continuing to look at the ONNX export section starting from line 41, opset_version is the converted ONNX version, which defaults to 12 and must be changed to 11!!! Otherwise, converting the model to a bin file will result in an error; In addition, the inputname here is also something you need to remember

Go to the forward method on line 22 of models/yolo. py, and set x [i] on line 29 to x [i]. view (Omitted) Contiguous() comment out and replace with x [i]=x [i]. permute (0, 2, 3, 1). consistent ()

After saving the modifications, run export.py directly and you can see the onnx file in runs/exp/weights/

Convert to bin file

Move the ONNX model to another folder and enter the toolchain environment (there are many tutorials on the deployment and use of toolchains on the website, and we won’t go into detail about the deployment process here). Below is the folder structure specifically used by the poster for model conversion, for reference

Hb_mapper_makertbin.log is a log file automatically generated by the toolchain Imgs_train stores the original images from the training set Imgs_cal stores the image calibration data required for the conversion model Trans.py is used to convert the original training set images into calibration data Medels_onnx stores the onnx model before conversion Models’onnx stores the converted bin file Tran.yaml is the text required for conversion Validate model Using the hb_mapper checker instruction to validate the model

The input shape here is the parameter that needs to be remembered before

As can be seen, only a few operators in the new model are not supported by BPU

<img

Prepare image calibration data In terms of calibration data conversion, we have borrowed from the experts of Xiaoxixi, and there is also a complete teaching in the Horizon manual, which will not be elaborated in this section Model conversion Before conversion, it is necessary to prepare a YAML file. My YAML file is for everyone’s reference, and the areas that need to be modified have been marked!

model_parameters:
# ----------------------------------------
  onnx_model: './models_onnx/head.onnx' #原模型位置
  output_model_file_prefix: 'head_yolov5' #模型名字
# ----------------------------------------
  march: 'bernoulli2'
input_parameters:
  input_type_train: 'rgb'
  input_layout_train: 'NCHW'
  input_type_rt: 'nv12'
  norm_type: 'data_scale'
  scale_value: 0.003921568627451
  input_layout_rt: 'NHWC'
calibration_parameters:
# ----------------------------------------
  cal_data_dir: './imgs_cal/head' #校准数据位置
# ----------------------------------------
  calibration_type: 'max'
  max_percentile: 0.9999
compiler_parameters:
  compile_mode: 'latency'
  optimize_level: 'O3'
  debug: False
  core_num: 2

Conversion instruction hb_mapper makertbin – location of config yaml file – model type onnx

You can see the bin file model in the generated model_output folder

On board operation

For convenience, we directly CV the example file/app/ai_inference/07_yolov5_sample of the horizon to our own home and move the bin model to the board

Open test_yolov5.py and go to the main function entrance

The forward below is the reasoning process, and the postprocess behind it is the post-processing process Let’s add performance testing and change the model_hw_shape to the size of our own model. Also, remember to modify the model path and image path

Next, open postprocess.py and change the num_class in line 21 to the number of classes in your own model

Looking at the three boxes on line 35 again

The second and third parameters of the reshape are the size of your model divided by 8, 16, and 32, respectively; The fifth parameter needs to be changed to the previous num_class+5

Modification completed, run test_yolov5. py

The inference time is about 42ms, which is very similar to the PC version, but the model accuracy is low and the post-processing time is still relatively long Regarding the accuracy calibration of the model, it is mentioned in the toolchain manual for your reference As for post-processing, it is highly recommended that you read the article by Xiao Xixi, which mentions using Cython to accelerate post-processing and achieve outstanding results

Summary

I have been using X3pi for over half a year now, and I have also started using X3pi to participate in some competitions. As a domestic product, it is really excellent. Although the use of BPU is troublesome, please ask the technical personnel of Horizon (thank you very much, Mr. Xu!). With the help of manuals (it has to be said that the X3pi manual is really too comprehensive), tutorials, and other help, I will find that BPU’s computing power as a model inference is really powerful, and after tuning, there will be a sense of pride in China (manual dog head). I look forward to more products from Horizon