Multi-model Batch Inference

D-Robotics · July 31, 2023, 6:39am

1 Preface

Each time the inference task is executed, the underlying system needs to respond to an interrupt, and if each small model is placed in a separate inference task, the frequency of interruptions will increase, resulting in an increase in time. If multiple small models are bound into one inference task, the number of low-level interruptions in the system will be reduced, which will reduce the system overhead and reduce the time. Both XJ3 and J5 support sequentially invoking multiple small models to predict multiple pieces of data in a single inference task. However, it should be noted that only after all the models in the inference task have been predicted, all the predictions will be written to memory. In other words, the user cannot know the prediction results of some models in advance before the conclusion of the inference task. In addition, this reasoning method does not support the repeated loading of the same model, if you want to have a model reasoning multiple images at the same time, you can consider using Batch mode. Reference community articles

2 Example introduction

The ddk/samples/ai_toolchain/horizon_runtime_sample directory of the OE package contains a large number of basic examples for on-board deployment. The file structure of the directory is as follows:

+---horizon_runtime_sample
├── code                        
│   ├── 00_quick_start          
│   ├── 01_api_tutorial         
│   ├── 02_advanced_samples
│   │   ├── custom_identity
│   │   ├── multi_input
│   │   ├── multi_model_batch
│   │   └── nv12_batch    
│   ├── 03_misc                 
│   ├── build_j5.sh             
│   ├── build_x86.sh            
│   ├── CMakeLists.txt
│   ├── CMakeLists_x86.txt
│   └── deps_gcc9.3             
├── j5
│   ├── data                    
│   ├── model
│   ├── script                  
│   └── script_x86              
└── README.md

The code folder contains the sample C++ code and compile-related files, and the j5 folder contains the sample script and the executable file generated by the compilation, and presets the data and related models. The script in the script directory can be run on the development board to execute the corresponding model reasoning examples. The quick start example highlighted in this article is multi_model_batch, which calls the googlenet_224x224_nv12.bin and mobilenetv2_224x224_nv12.bin classification models to read two jpg images, The forward inference is performed in a single inference task, and the Top5 classification results of the two graphs are obtained after post-processing calculation.

Before formally learning the code, developers are expected to be familiar with the on-board deployment API provided by Horizon. In this part, you can check the BPU SDK API chapter of the Toolchain manual. This chapter not only introduces the API interface in detail, but also comprehensively introduces the data types and data interfaces related to on-board deployment. Data arrangement and alignment rules, error codes, and so on. You can also read the sample code while scrolling through the API manual to learn. In addition, it is recommended that the first contact tool chain plate end deployment developers preferred reading "quick-and-dirty] [model reasoning (https://developer.horizon.cc/forumDetail/174216099150358528), This paper gives a detailed analysis of the horizon_runtime_sample sample code 00_quick_start. The code structure of multi_model_batch is similar to 00_quick_start.

3 Core code interpretation

// get model handle
  hbDNNHandle_t dnn_handle_googlenet;
  hbDNNHandle_t dnn_handle_mobilenetv2;
  ......
  
  // read input file and convert img to nv12 format
  cv::Mat nv12_mat_googlenet;
  cv::Mat nv12_mat_mobilenetv2;
  ......
  
  // prepare input tensor
  hbDNNTensor input_tensor_googlenet;
  hbDNNTensor input_tensor_mobilenetv2;
  ......
  
  // prepare output tensor
  hbDNNTensor *output_tensor_googlenet = new hbDNNTensor();
  hbDNNTensor *output_tensor_mobilenetv2 = new hbDNNTensor();
  ......

Because the example calls two models, two model handles need to be defined, the program needs to read two images and write one to the memory space of the two input tensors, and each model needs to prepare a copy of the memory space of the input and output tensors.

// Run inference
  hbDNNTaskHandle_t task_handle = nullptr;
  hbDNNInferCtrlParam infer_ctrl_param;
  HB_DNN_INITIALIZE_INFER_CTRL_PARAM(&infer_ctrl_param);
  // submit first model task
  infer_ctrl_param.more = 1;
  HB_CHECK_SUCCESS(hbDNNInfer(&task_handle,
                              &output_tensor_googlenet,
                              &input_tensor_googlenet,
                              dnn_handle_googlenet,
                              &infer_ctrl_param),
                   "hbDNNInfer failed");
  // submit second model task
  infer_ctrl_param.more = 0;
  HB_CHECK_SUCCESS(hbDNNInfer(&task_handle,
                              &output_tensor_mobilenetv2,
                              &input_tensor_mobilenetv2,
                              dnn_handle_mobilenetv2,
                              &infer_ctrl_param),
                   "hbDNNInfer failed");
  VLOG(EXAMPLE_DEBUG) << "infer success";
  // wait task done
  HB_CHECK_SUCCESS(hbDNNWaitTaskDone(task_handle, 0),
                   "hbDNNWaitTaskDone failed");
  VLOG(EXAMPLE_DEBUG) << "task done";

Because you’re inferring two models in a single task, the hbDNNInfer interface is invoked twice. The difference between these two calls is the more parameter of the inference control parameter: ** The more parameter of the last inference control parameter is set to 0, and the more before it is set to 1**. The meaning of the more parameter is whether other models are followed after setting this model. If there is a follow, set 1; if there is no follow, set 0. The two inferences share a task_handle and therefore belong to the same task. Calling the hbDNNWaitTaskDone interface after the two inferences have been completed will write the result of the inference to the memory of the output tensor.

// post process
  get_topk_result(output_tensor_googlenet, top_k_cls, 1);
  ......
  get_topk_result(output_tensor_mobilenetv2, top_k_cls, 1);
  ......
  
  // release task handle
  HB_CHECK_SUCCESS(hbDNNReleaseTask(task_handle), "hbDNNReleaseTask failed");

After the inference result is written into the output memory, two TopK post-processing is performed to obtain the classification prediction result of two images. Then release the task handle to end the current task.

4 board end run

This example is very simple to run on the board, first execute the build_j5.sh script in the code folder, and then generate files and dependencies in the j5 folder. In the J5 folder, data stores the input data used for model inference, the model folder stores the model of each example, and in the script folder, in addition to running scripts, there are also dynamic link libraries generated after compilation.so files and executable files. We copy the entire j5 folder to the board, go to the j5/script/02_advanced_samples directory, and run the run_multi_model_batch.sh script to run the multi-model batch inference sample on the development board. This example runs on the J5 development board with the following terminal print:

root@j5dvb-hynix8G:/userdata/chaoliang/j5/script/02_advanced_samples# sh run_multi_model_batch.sh
../aarch64/bin/run_multi_model_batch --model_file=../../model/runtime/googlenet/googlenet_224x224_nv12.bin,../../model/runtime/mobilenetv2/mobilenetv2_224x224_nv12.bin --input_file=../../data/cls_images/zebra_cls.jpg,../../data/cls_images/zebra_cls.jpg
I0000 00:00:00.000000 10916 vlog_is_on.cc:197] RAW: Set VLOG level for "*" to 3[BPU_PLAT]BPU Platform Version(1.3.3)!
[HBRT] set log level as 0. version = 3.15.18.0
[DNN] Runtime version = 1.17.2_(3.15.18 HBRT)[A][DNN][packed_model.cpp:225][Model](2023-04-11,17:57:43.547.52) [HorizonRT] The model builder version = 1.15.0
[A][DNN][packed_model.cpp:225][Model](2023-04-11,17:57:51.811.477) [HorizonRT] The model builder version = 1.15.0
I0411 17:57:51.844280 10916 main.cpp:117] hbDNNInitializeFromFiles success
I0411 17:57:51.844388 10916 main.cpp:125] hbDNNGetModelNameList success
I0411 17:57:51.844424 10916 main.cpp:139] hbDNNGetModelHandle success
I0411 17:57:51.875140 10916 main.cpp:153] read image to nv12 success
I0411 17:57:51.875686 10916 main.cpp:170] prepare input tensor success
I0411 17:57:51.875875 10916 main.cpp:182] prepare output tensor success
I0411 17:57:51.876082 10916 main.cpp:216] infer success
I0411 17:57:51.878844 10916 main.cpp:221] task done
I0411 17:57:51.878948 10916 main.cpp:226] googlenet class result id: 340
I0411 17:57:51.879084 10916 main.cpp:230] mobilenetv2 class result id: 340
I0411 17:57:51.879177 10916 main.cpp:234] release task success

As you can see, the terminal prints out the inference results for both models, and since the inference is on the same picture (zebra_cls.jpg), the inference results are the same.