Batch Model Inference

1 Preface

For medium and high computing power platforms such as J5, the model with small resolution input or small computational amount often cannot achieve high computational efficiency, which can be used to deploy the model in batch mode (batch>1), reasoning multiple images at a time, so as to improve the compute/access ratio. J5 supports batch model compilation by configuring the input_shape or input_batch parameters in the yaml file at model transformation time. J5 also supports reasoning on the batch model during on-board deployment. This article will explain the compilation and deployment of the batch model in detail.

2 Batch model compilation

Compiling the Batch model requires that the yaml file parameters input_shape and input_batch be configured correctly. According to the type of original model, it can be divided into dynamic input model and non-dynamic input model. For dynamic input models, for example, the input is? x3x224x224, you must specify the model input information using the input_shape parameter. When the input_shape is set to 1x3x224x224 and is a single input model, you can use the input_batch parameter to compile a multi-batch model. When the first dimension of the input_shape is configured as an integer greater than 1, the original model itself will be identified as a multi-batch model. The input_batch parameter will not be available. For a non-dynamic input model, if the input shape[0] is 1 and it is a single input model, the input_batch parameter can be used. If the input shape[0] is not 1, the input_batch parameter cannot be used. In other words, the input_batch parameter can only be used when there is a single input and the first dimension of input_shape is 1. If you want to compile the multi-input model in batch mode, you need to set the batch parameters of the different input branches of the model when the original open source framework (pytorch, tensorflow, etc.) exports onnx. The Horizon Toolchain supports multiple input branches with the same batch and different batches. In addition, for dynamic input models, if the first dimension of input_shape is configured as an integer greater than 1, then the shape of the calibration data needs to be aligned with the input_shape at this time. In other cases, the shape of the calibration data does not require special processing and is 1x3x224x224.

3 Batch model deployment

3.1 Example introduction

The ddk/samples/ai_toolchain/horizon_runtime_sample directory of the OE package contains a large number of basic examples for on-board deployment. The file structure of the directory is as follows:

+---horizon_runtime_sample
├── code                        
│   ├── 00_quick_start          
│   ├── 01_api_tutorial         
│   ├── 02_advanced_samples   
│   │   ├── custom_identity
│   │   ├── multi_input
│   │   ├── multi_model_batch
│   │   └── nv12_batch
│   ├── 03_misc                 
│   ├── build_j5.sh             
│   ├── build_x86.sh            
│   ├── CMakeLists.txt
│   ├── CMakeLists_x86.txt
│   └── deps_gcc9.3             
├── j5
│   ├── data                    
│   ├── model
│   ├── script                  
│   └── script_x86              
└── README.md

The code folder contains the sample C++ code and compile-related files, and the j5 folder contains the sample script and the executable file generated by the compilation, and presets the data and related models. The script in the script directory can be run on the development board to execute the corresponding model reasoning examples. The sample in this article is nv12_batch in the 02_advanced_samples directory. The sample will run a googlenet_4x224x224_nv12.bin classification model with batch 4 and read four jpg images for two forward reasoning. The difference between the two forward reasoning lies in the different ways of memory allocation, and finally two groups of Top5 classification results are obtained after post-processing.

Developers are expected to be familiar with the on-board deployment API provided by Horizon before formally learning the code. This section can be viewed in the Toolchain manual [BPU SDK API](https://developer.horizon.cc/api/v1/fileData/horizon_j5_open_explorer_cn_doc/runtime/source/bpu_sdk_api/source/inde x.html) chapter, in addition to a detailed introduction to the API interface, this chapter also comprehensively introduces the data types and data interfaces related to board deployment, as well as data layout and alignment rules, error codes, and so on. You can also read the sample code while scrolling through the API manual to learn. In addition, it is recommended that developers contact tool chain, for the first reading first "quick-and-dirty] [model reasoning (https://developer.horizon.cc/forumDetail/174216099150358528), This article takes a closer look at the horizon_runtime_sample sample code 00_quick_start. Since the nv12_batch code structure is similar to 00_quick_start, the focus of this tutorial will be on the batch model.

3.2 Program structure

In the source code, the main steps of the main function are shown in the figure above. The main difference is that Infer1 allocates BPU memory for each of the four input images and uses four discrete blocks of memory to store the input tensor, while Infer2 allocates a single contiguous BPU memory for each of the four input images to store all the data in turn. Both methods belong to batch model reasoning, and developers can choose different memory allocation methods for different usage scenarios. For example, if a batch 4 model is deployed on J5 to process the data collected by four cameras, then considering that four graphs of 1 frame data will be stored on four discrete blocks of memory, it is necessary to use the Infer1 method to do batch model inference. For example, if you only do backloading and read local existing pictures to do reasoning, then you can use Infer2 to store 4 pictures in a whole continuous memory space and then do batch model reasoning. Next, analyze the key code of Infer1 and Infer2 respectively.

3.3 Infer1 Code interpretation

For Infer1, the core code is the custom functions prepare_tensor_batch_separate and read_image_2_tensor_as_nv12_batch_separate.

int prepare_tensor_batch_separate(std::vector<hbDNNTensor> &input_tensor,
                                  std::vector<hbDNNTensor> &output_tensor,
                                  hbDNNHandle_t dnn_handle) {
  ......                                  
  for (int i = 0; i < input_count; i++) {
  ......
    if (input.properties.tensorType == HB_DNN_IMG_TYPE_NV12) {
      int32_t batch = input.properties.alignedShape.dimensionSize[0];
      int32_t batch_size = input.properties.alignedByteSize / batch;
      //Modify properties batch as 1 and modify alignedByteSize.
      input.properties.alignedByteSize = batch_size;
      input.properties.validShape.dimensionSize[0] = 1;
      input.properties.alignedShape = input.properties.validShape;
      for (int j{0}; j < batch; j++) {
        HB_CHECK_SUCCESS(hbSysAllocCachedMem(&input.sysMem[0], batch_size),
                         "hbSysAllocCachedMem failed");
        input_tensor.push_back(input);
      }
    } else if (input.properties.tensorType == HB_DNN_IMG_TYPE_NV12_SEPARATE) {
    ......
    } else {
    ......
    }
  }
  ......
  return 0;
}

In the function prepare_tensor_batch_separate, taking the nv12 data type as an example, the variable batch refers to the number of batches set when the model is compiled, and the variable batch_size refers to the size of bytes after a single batch of data is aligned. For Infer1 the reasoning method of separate application memory, you need to configure the tensor alignment after byte size and shape of the first dimension numerical effectively, the former need to input. The properties. The alignedByteSize configuration for the batch_size, Byte size after the single input data alignment,. The latter will need to input the properties. ValidShape. DimensionSize [0] is set to 1. In this way, when the actual reasoning, the board side reasoning library can correctly parse the information input tensor. In the for loop of j and batch, each loop uses the horizon encapsulated memory allocation interface hbSysAllocCachedMem to allocate batch_size memory space for the input tensor with batch times. That is, the prepare_tensor_batch_separate function is used to loop and batch the same number of times to request memory separately for each input data.

int32_t read_image_2_tensor_as_nv12_batch_separate(
    std::string &image_file, std::vector<hbDNNTensor> &input_tensor) {
  ......
  for (int32_t i{0}; i < input_tensor.size(); i++) {
    hbDNNTensor &input = input_tensor[i];
    hbDNNTensorProperties &Properties = input.properties;
    int input_h = Properties.validShape.dimensionSize[2];
    int input_w = Properties.validShape.dimensionSize[3];
    cv::Mat bgr_mat = cv::imread(input_image_file[i], cv::IMREAD_COLOR);
    // convert to YUV420
    // copy y data
    // copy uv data
    ......  
  }
  return 0;
}

The function read_image_2_tensor_as_nv12_batch_separate uses a loop to read the length and width information of each tensor in the batch successively (input_tensor.size() and batch are equal). Each input data is then copied into the corresponding memory space in turn.

3.4 Infer2 Code interpretation

For Infer2, the core code is the custom functions prepare_tensor_batch_combine and read_image_2_tensor_as_nv12_batch_combine.

int prepare_tensor_batch_combine(std::vector<hbDNNTensor> &input_tensor,
                                 std::vector<hbDNNTensor> &output_tensor,
                                 hbDNNHandle_t dnn_handle) {
  ......
  for (int i = 0; i < input_count; i++) {
    hbDNNTensor input;
    HB_CHECK_SUCCESS(
        hbDNNGetInputTensorProperties(&input.properties, dnn_handle, i),
        "hbDNNGetInputTensorProperties failed");
    HB_CHECK_SUCCESS(
        hbSysAllocCachedMem(&input.sysMem[0], input.properties.alignedByteSize),
        "hbSysAllocCachedMem failed");
    input.properties.alignedShape = input.properties.validShape;
    input_tensor.push_back(input);
  }
  ......
  return 0;
}

Prepare_tensor_batch_combine function for once for all in batch input data for the whole block of successive memory, input variables. The properties. The alignedByteSize aligned for all input data after the sum of the size, in bytes in memory footprint.

int32_t read_image_2_tensor_as_nv12_batch_combine(
    std::string &image_file, std::vector<hbDNNTensor> &input_tensor) {
  hbDNNTensor &input = input_tensor[0];
  hbDNNTensorProperties &Properties = input.properties;
  int batch = Properties.alignedShape.dimensionSize[0];
  int input_h = Properties.validShape.dimensionSize[2];
  int input_w = Properties.validShape.dimensionSize[3];
  ......
  auto data = reinterpret_cast<uint8_t *>(input.sysMem[0].virAddr);
  auto batch_size = Properties.alignedByteSize / batch;
  for (int32_t i{0}; i < batch; i++) {
    cv::Mat bgr_mat = cv::imread(input_image_file[i], cv::IMREAD_COLOR);
    // convert to YUV420
    // copy y data
    // copy uv data
    ......
    data += batch_size;
  }
  return 0;
}

The function read_image_2_tensor_as_nv12_batch_combine uses a loop to cycle batch times and store each input data into a continuous memory space, where the variable data indicates the first address of each data in memory. batch_size indicates the aligned byte size occupied by each piece of data.

3.5 Board end run

The on-board operation of the batch model inference example is very simple. First execute the build_j5.sh script in the code folder. After the execution, files and related dependencies will be generated in the j5 folder. Then go to the j5/script/02_advanced_samples directory and run the run_nv12_batch.sh script to run the batch model inference sample on the development board. This example runs on the J5 development board with the following terminal print:

I0000 00:00:00.000000 21511 vlog_is_on.cc:197] RAW: Set VLOG level for "*" to 3[BPU_PLAT]BPU Platform Version(1.3.3)!
[HBRT] set log level as 0. version = 3.15.18.0
[DNN] Runtime version = 1.17.2_(3.15.18 HBRT)
I0705 11:39:43.429180 21511 nv12_batch.cc:151] Infer1 start
I0705 11:39:43.488143 21511 nv12_batch.cc:166] read image to tensor as nv12 success
I0705 11:39:43.491156 21511 nv12_batch.cc:201] Batch[0]:
I0705 11:39:43.491211 21511 nv12_batch.cc:203] TOP 0 result id: 340
I0705 11:39:43.491240 21511 nv12_batch.cc:203] TOP 1 result id: 83
I0705 11:39:43.491266 21511 nv12_batch.cc:203] TOP 2 result id: 41
I0705 11:39:43.491298 21511 nv12_batch.cc:203] TOP 3 result id: 912
I0705 11:39:43.491324 21511 nv12_batch.cc:203] TOP 4 result id: 292
I0705 11:39:43.491348 21511 nv12_batch.cc:201] Batch[1]:
I0705 11:39:43.491374 21511 nv12_batch.cc:203] TOP 0 result id: 282
I0705 11:39:43.491398 21511 nv12_batch.cc:203] TOP 1 result id: 281
I0705 11:39:43.491422 21511 nv12_batch.cc:203] TOP 2 result id: 285
I0705 11:39:43.491447 21511 nv12_batch.cc:203] TOP 3 result id: 287
I0705 11:39:43.491472 21511 nv12_batch.cc:203] TOP 4 result id: 283
I0705 11:39:43.491497 21511 nv12_batch.cc:201] Batch[2]:
I0705 11:39:43.491514 21511 nv12_batch.cc:203] TOP 0 result id: 340
I0705 11:39:43.491539 21511 nv12_batch.cc:203] TOP 1 result id: 83
I0705 11:39:43.491564 21511 nv12_batch.cc:203] TOP 2 result id: 41
I0705 11:39:43.491587 21511 nv12_batch.cc:203] TOP 3 result id: 912
I0705 11:39:43.491612 21511 nv12_batch.cc:203] TOP 4 result id: 292
I0705 11:39:43.491637 21511 nv12_batch.cc:201] Batch[3]:
I0705 11:39:43.491662 21511 nv12_batch.cc:203] TOP 0 result id: 282
I0705 11:39:43.491685 21511 nv12_batch.cc:203] TOP 1 result id: 281
I0705 11:39:43.491710 21511 nv12_batch.cc:203] TOP 2 result id: 285
I0705 11:39:43.491734 21511 nv12_batch.cc:203] TOP 3 result id: 287
I0705 11:39:43.491760 21511 nv12_batch.cc:203] TOP 4 result id: 283
I0705 11:39:43.492235 21511 nv12_batch.cc:223] Infer1 end
I0705 11:39:43.492276 21511 nv12_batch.cc:228] Infer2 start
I0705 11:39:43.549713 21511 nv12_batch.cc:243] read image to tensor as nv12 success
I0705 11:39:43.552248 21511 nv12_batch.cc:278] Batch[0]:
I0705 11:39:43.552292 21511 nv12_batch.cc:280] TOP 0 result id: 340
I0705 11:39:43.552320 21511 nv12_batch.cc:280] TOP 1 result id: 83
I0705 11:39:43.552345 21511 nv12_batch.cc:280] TOP 2 result id: 41
I0705 11:39:43.552371 21511 nv12_batch.cc:280] TOP 3 result id: 912
I0705 11:39:43.552397 21511 nv12_batch.cc:280] TOP 4 result id: 292
I0705 11:39:43.552421 21511 nv12_batch.cc:278] Batch[1]:
I0705 11:39:43.552445 21511 nv12_batch.cc:280] TOP 0 result id: 282
I0705 11:39:43.552469 21511 nv12_batch.cc:280] TOP 1 result id: 281
I0705 11:39:43.552495 21511 nv12_batch.cc:280] TOP 2 result id: 285
I0705 11:39:43.552520 21511 nv12_batch.cc:280] TOP 3 result id: 287
I0705 11:39:43.552567 21511 nv12_batch.cc:280] TOP 4 result id: 283
I0705 11:39:43.552592 21511 nv12_batch.cc:278] Batch[2]:
I0705 11:39:43.552616 21511 nv12_batch.cc:280] TOP 0 result id: 340
I0705 11:39:43.552641 21511 nv12_batch.cc:280] TOP 1 result id: 83
I0705 11:39:43.552665 21511 nv12_batch.cc:280] TOP 2 result id: 41
I0705 11:39:43.552690 21511 nv12_batch.cc:280] TOP 3 result id: 912
I0705 11:39:43.552716 21511 nv12_batch.cc:280] TOP 4 result id: 292
I0705 11:39:43.552739 21511 nv12_batch.cc:278] Batch[3]:
I0705 11:39:43.552763 21511 nv12_batch.cc:280] TOP 0 result id: 282
I0705 11:39:43.552788 21511 nv12_batch.cc:280] TOP 1 result id: 281
I0705 11:39:43.552812 21511 nv12_batch.cc:280] TOP 2 result id: 285
I0705 11:39:43.552837 21511 nv12_batch.cc:280] TOP 3 result id: 287
I0705 11:39:43.552861 21511 nv12_batch.cc:280] TOP 4 result id: 283
I0705 11:39:43.553154 21511 nv12_batch.cc:300] Infer2 end

As you can see, the terminal prints the Top5 classification results of the inference twice, and the results are exactly the same.

4 Performance comparison

To visualize the performance improvements that batch mode on J5 can bring to small models, several sets of experimental data are provided for reference.

4.1 Experimental conditions

  • J5 System software version: LNX5.10_REL_PL3.0_20221128-161022 release -J5 Tool chain version: 1.1.49b
  • Model source: horizon_model_convert_sample/03_classification/02_googlenet
  • Model input configuration: NCHW
  • Model input size: 1x3x224x224, 8x3x224x224
  • Performance test tool: hrt_model_exec

4.2 Comparison of BPU Usage of single-core single-thread

Run the perf function of the hrt_model_exec tool in single-core, single-thread mode, and run the hrut_bpuprofile -b 2-r 0 command to check the BPU usage. It can be seen that when batch=8, the single-core BPU occupancy rate is significantly increased compared with batch=1.

4.3 Single-core single-thread Latency comparison

If the comparison is also based on 8 graphs, when batch=1, the Latency of the model inference for 8 graphs is 1.13ms * 8 = 9.04ms, which is significantly longer than 3.05ms with batch=8.

4.4 Dual-core 8-thread FPS comparison

Since batch=8, the model 1 frame will reason 8 graphs, so the FPS is actually 790*8=6320, far more than the 2703 of batch=1.

4.5 Experimental conclusions For small models, batch mode can improve BPU occupancy and reduce computing power waste, while Latency and FPS are significantly better than single batch model.